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FINDING AN E-MAIL MESSAGE TO WHICH ANOTHER 
E-MAIL MESSAGE IS A RESPONSE 

5 Cross-references to related applications 

This application clairrs the benefit of U.S. 
Provisional Application No. 60/019264, filed June 7, 1996, 
entitled "Finding an E-mail Message to Which Another E-mail 
Message Is a Response.' 1 

I0 

Appendix 

An attached appendix (pages 25 - 80) has been provided 
which lists the source code of the programs developed to 
carry out the experiments described below in connection 
15 with the present invention. 

Copyright Notice 

A portion of the disclosure of this patent document 
contains material which is subject to copyright protection. 
20 The copyright owner has no objection to the facsimile 

reproduction by anyone of the patent document or the patent 
disclosure, as it appears in the Patent and Trademark 
Office patent file or records, but otherwise reserves all 
copyright rights whatsoever. _ 

25 

4- 

Technical Field 

\ 

This invention relates to electronic messaging and, 
more particularly, to a way of recognizing and manipulating 
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threads contained in electronic messages. 

Background of the Invention 

The volume of electronic messages, such as electronic 
5 mail ("e-mail"), is huge and growing. Many users receive 
more messages than they can handle, which has sparked 
interest in better message handling software. Almost all 
e-mail readers now support separating messages into 
folders, and often allow rules to be defined to do this 

10 automatically. Tools for prioritizing and searching 
messages are also becoming available. 

A problem with most such approaches is that they 
process each message individually. Many messages are parts 
of larger conversations, or threads. A thread is a 

15 conversation among two or more participants carried out by 
exchange of messages. Treating messages outside of this 
context may lead to undesirable results. For instance, a 
system that sorts messages into folders based on their 
content is unlikely to be 100% accurate. The effectiveness 

20 of content-based text categorization systems varies 

considerably among categories, and accuracies over 95% are 
rarely reported. This means that threads having as few as 
20 component messages will almost always be broken up and 
distributed into multiple folders by such a system, making 

25 it difficult for a reader to follow the conversational 
structure. 

On the other hand, a mail reading interface that 
understood threads could save users considerable effort. 
For instance, some programs for reading Usenet news allow 
30 users to delete an entire thread at once, greatly reducing 
the numbe.r of messages the user n*ust inspect. 

Messaging systems that are explicitly oriented to 
group discussion, e.g., the Usenet network and other 
bulletin .board systems, provide the most support for 
35 threading. For instance, the reply command in most Usenet 
news posting programs inserts into a reply or child message 

two forms of information about the relationship between it 
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and its parent message (the message it is a reply to) 
First, the chain of undque message identifiers in the 
References: field of the parent is copied into the References 
field of the child, with the unique identifier of the 
5 parent added. Second, the Subject: line of the parent is 
copied into the Subject: line of the child, typically 
prefixed by Re:. Usenet news readers providing a threaded 

display use the structural links from the References: field, 
while others organize a threaded display around Subject: 

0 lines which are identical or have identical prefixes. 

Conversations, including group discussions, can also 
be carried out over electronic mail systems. The ability 
to send to and reply to groups of people, as well as the 
use of centralized mail "reflectors" and mailing list 

5 management software, can informally support multiple large 
scale discussions. As with bulletin board systems, 
replying to an e-mail message often inserts structural 
information into the reply. For Internet -based mail 
systems, the reply command may copy the Message-Id: field or 

0 other identifying information from the parent, into the 

In-Repiy-To: field of the child. As in Usenet messages, the 
Subject: line is typically copied to the Subject: field, 
preceded by Re:. 

Some mail clients provide threaded displays, though 
5 this is less common than for bulletin board systems. For 
instance, the VM mail reader (available at ftp.uu.net in 
networking/mail/vm directory) allows grouping of messages 
by one of several criteria, including having the same 
subject line text, the "same author, or the same recipient. 
0 The mail archiving program hypermall (see 

ht tp : //www . eit . com/ sof t ware/ hype rmail . html ) marks up 
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archives of e-mail with a variety of links, including 
threading information. It attempts first to find a message 
id in the In-Reply-To: field and match it to a known message. 
Failing that it looks for a matching date string in the 
5 In-Reply-To: field, and finally tries for a match" on the 
Subject: line , after removing one Re : tag . 

However, the error rate of each of the above 
approaches is considerable. While the References: field is 
in theory required for replies to Usenet messages, 
10 threading is hampered by clients that delete portions of 
the References: chain due to limitations on field length. 
In Internet electronic mail, the use of Message-Id: and 
In-Reply-To: fields are optional and their format and nature 

is only loosely constrained when they are present . Subject: 
15 lines for both Usenet messages and Internet mail are 

allowed to contain arbitrary text, clients are inconsistent 
in their use of Re: tags, and manual edit-ing of Subject: 

lines further confuses the issue. Furthermore, current. 

approaches to threading are to some extent misconceived, as 
20 they rely upon rapidly changing conventions in software 

communication . 

While user clients typically insert in messages 

structural information useful for recovering threads, 

inconsistencies between clients, loose standards, creative 
25 user behavior, and the subjective nature of conversation 

make current threading systems only partially successful, 

and the situation is unlikely to change . 

One' approach to dealing with the above situation is to 

try to force clients to follow, tighter standards for 
30 specifying threads... However, such an approach does not 

appear practical in light of the increasing diversity of 
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clients and the growing interconnection of only partially 
compatible messaging systems. Tighter standards also do 
not help in recovering thread structure from archived 
messages, since deletion of fields such as In-Reply-To: by 
5 archiving and digest if ying programs is common. 

It is also not clear that threads should be identified 
with trees of reply links. The reply command is often used 
to avoid retyping a mail address, rather than to continue a 
conversation. Further, users will disagree about what is 
0 on-topic in a thread, and off-topic responses can easily 

spawn subdiscussions . Conversely, on-topic contributors to 
a discussion may simply send a message rather than using 
the reply command. 

This suggests that the links desired for display in a 
5 threading interface, and which result in structures to be 
processed as a unit, are actually not objectively defined 
"pattern-matching" or "structural 1 ' links. The link desired 
to be captured is that of a response in an ongoing 
discourse. The fact that users are able to participate in 
0 online discussions, despite the in-adequacies of current 

threading software, suggests that most messages contain the 
contextual information to understand their place in an 
ongoing conversation. Thus it is at least possible that an 
automated system will be able to make use of this 
5 information as well to make this conversational structure 
explicit as a thread. 

The role of cohesion or linking between the parts of a 
dialogue has been recognized. Language provides a variety 
of mechanisms for achieving this cohesion. One such 
0 mechanism " is lexical cohesion and in particular lexical 

repetition, that is, the repeating of words in linked parts 
of a discourse . 

The phenomenon of lexical repetition suggests that the 
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similarity of the vocabulary between two messages should be 
a powerful clue to whether a response relationship exists 
between them. Measuring the similarity of vocabulary 
between texts is, of course, a widely used strategy for 

5 finding texts with similar topic to a query. Indeed, 
similarity-based methods have been used to construct 
hypertexts linking documents or passages of documents on 
the basis of topic similarity. 

Attempts have also been made to go beyond unlabeled 

10 linking to use similarity matching in detecting discourse 
relations. Hearst's TextTiling algorithm (see M.A. 

Hearst, "Mult i -paragraph Segmentation of Expository Text," 
32nd Annual Meeting of the Association for Computational 
Linguistics at Pp. 9-16, Las Cruces, NM June 27-30, 1994) 

15 uses vector space similarity to decompose a text into 
topically coherent segments. Also used is the graph 
structure of a network of" raw similarity links to infer 
meta- links corresponding to discourse relations such as 
comparison and summarization (see J. Allan, "Automatic 

20 Hypertext Link Typing," Proceedings of Hypertext "96, 

1996). These lines of evidence suggest text similarity 
could be a clue to the existence of a response relation 
between messages as well . 

What is desired is a way to utilize robust conventions 

25 in human communication in place of, or in addition to, 
software conventions in order to produce an effective 
message threading system . 



30 



Summary of the Invention 

An object of the present invention is to utilize the 
textual content and characteristics of messages to provide 
a more reliable and effective way to construct message 
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threads. In accordance with the present invention, 
• statistical information retrieval techniques are used in 
conjunction with textual material obtained by "filtering" 
of messages to achieve a significant level of accuracy at 
5 identifying when one message is a reply to another. 

Brief Description of the Drawings 



10 



15 



20 



30 



FIGURE 1 shows the results of experimentation for a 
matching strategy used in an embodiment of the present 
invention. 

FIGURE 2 contains a diagram showing an embodiment of 
the present invention. 

FIGURE 3 contains a diagram showing a more generaliz 
embodiment of the present invention. 



Detailed Description 

Threading of electronic messages should be treated as 
a language processing task. The present invention utilize 
textual context and characteristics of messages in order t 
provide a more reliable and effective way to construct 
message threads. Preliminary experiments show chat a 
significant level of threading effectiveness can be 
achieved by applying standard text matching methods from 
25 information retrieval techniques to the textual portions o 
messages. In accordance with the present invention, 
statistical information retrieval techniques are used in 
conjunction with textual material obtained by "filtering" 
of messages to achieve a significant level of accuracy at 
identifying when one message is a reply to another. A 
preferred embodiment of the present invention will now be 
described with reference to the experiments described 
below. The experiments are meant to be illustrative of the 
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process of the present invention and are not intended to be 
limiting. 

F.vpprimpnfc; 

The goal in experimentation was to test the ability of 
5 various linguistic clues to indicate whether one message 

was a response to another. Three types of textual material 
from messages were investigated: (l) the Subject: line; (2) 
quoted material in the message; and (3) the (unquoted) text 
of the message itself. The results of the experiments 

10 conducted show that statistical information retrieval 

techniques can achieve a significant level of accuracy at 
identifying when one message is a reply to another. 

Text from the Subject: line is a good clue that a 
message belongs to a particular thread, though it may not 

15 directly indicate which message in the thread is being 

replied to. Quoting of material from the parent message, 
particularly quotes of several lines, is a much stronger 
form of context. Salton and Buckley in an article entitled 
"Global Text Matching for Information Retrieval, " Science, 

20 253:1012-1015 (August, 1991), showed that text matching on 
a collection of Usenet messages which included substantial 
quoted material was highly effective at retrieving related 
messages, under a definition of relatedness that subsumed 
the response relationship of interest . 

25 Further, the actual text of the reply can be expected, 

based on the coherence phenomena described earlier, to 
repeat words from the parent message. Since new material 
will be present as well , it is expected this to be a 
somewhat weaker clue than the Subject: line and quoted text. 

30 a. Hafa and Prpparar.ion - 

A corpus of 2435 messages posted to the www- talk 

mailing list, during the period February 1994 through July 
1994 were obtained from the archives at URL 
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http: //www. w3 . org/hypertext/WWW/Archi ve/www- talk. 
A total of 941 of these messages had an !n-Reply-To: field 
containing a unique identifier from the Message-Id: field of 
another message in the corpus. While it is suggested 
herein that In-Reply-To: links will not always correspond to 
the discourse response links of interest, they provide a 
reasonable initial test of the ability of text matching to 
find connections that are response- 1 ike . Therefore, these 
941 child-parent pairs were used as ground truth against 
which methods. for finding parent messages were tested. 

Simple message filters were written to extract the 
three types of textual material (referred to above) from 
each message: (1) the text of the Subject: field; (2) 
unquoted text from the message body ; and (3) quoted text 
15 from the message body. This resulted in three collections 
of 243 5 document representatives, one for each type of 
textual material . Some messages had empty document 
representatives in some of the databases (for instance, a 
message might have no quoted material) and so could not be 
0 retrieved from that database. These messages were used as 
"target" messages for the matching strategies described 
herein. Target messages represented the potential parent 
messages matched against a given "query" (child) message 
chosen from the database. The "best" match of the target 
5 ■ messages (excluding the query message) for a given query 
message represents a potential parent message. 

Each of the three collections was indexed using 
Version 11.0 of the SMART experimental text retrieval 
system,, obtained June 13, 1995 from directory pub/smart at 
0 ftp.cs.cornell.edu. The SMART "text retrieval system uses 
statistical ^information retrieval techniques to rank target 
ssages based using the cosine similarity formula and a 



me! 
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variant of tf x idf weighting. Using the SMART system, 

target messages were represented as vectors of numeric 
weights : 



il / 



5 



where 



10 

and f ik is the number of times word k appears in message 
I. Query messages were similarly represented as vectors: 

<q l ,q 3 , . . . g*/ . • . / q t > 

15 where 



20 Here f k is the number of times the word occurs in the query 
message, N is the number of messages in the database, and 
n k is the number of messages containing word k, SMART 
scores- each target message J as 



9* = 



f k xlog(N/ m } 



25 




30 



b. P^rp^i ng 

Five text matching strategies were tested in the 
experiments' for their ability to retrieve the parent of a 
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message, given text from the child message. For each 
strategy, all 941 document representatives of identified 
child messages were run as queries against one of the three 
databases of 243 5 document representatives using the SMART 
5 system. This produced a ranking of all 243 5 target (that 
is, potential parent) messages for each query message. 
Messages which did not have any words in common with the 
query were not retrieved. They were assigned random ranks 
lower than that of any retrieved message. Documents were 

10 ranked by the score assigned by che SMART system 

processing. The code developed for carrying out the 
processing, message filtering and matching (with the 
exception of the SMART program which, as noted, was 
obtained from a publicly-available source) is included in 

15 the appendix (pages 25 - 80) , which is filed herewith and 
expressly incorporated by reference herein. 

Each strategy was a choice of what text from a child 
should be used as a query (i.e., what type of message 
filter to use for a child message) , and what text from 

20 target messages (i.e., what type of message filter) should 
be used to represent them in the database. The five 
combinations explored were: 



Queries 
Sub j ect text 
Unquoted text 
Unquoted text 
Quoted - text 
Quoted text 



Targets . 
Subject text 
Unquoted text 
Quoted text 
Unquoted text 
Quoted' text 
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c . Rxperi mental results 

FIGURE 1 displays the distribution of ranks of the 941 
parent documents with respect to each of the five forms of 
text matching. The value for rank 0 is the number of times 
5 a child retrieved its parent as the first document in the 
ranking, rank 1 indicates how often the parent was second 
in the ranking, and so on. In computing the rank of the 
parent, the child document (which was itself present in the 
database, though not necessarily in the same form as was 

10 used in querying) was removed from the ranking, so chat the 
ranks run from 0 to 2433 instead of 0 co 2434. 

Table 1 below shows the number of times the parent was 
retrieved at rank 0, ranks 0 to 4 , and ranks 0 to 9 for 
each of the search strategies used in the experimentation, 

15 over 941 trials. Comparison of this is made co the values 
that would be expected if the parent appeared at a random 
rank between 0 and 2433. 



Ranks 


Random 


Subj- 
Sub j 


Unquot- 
Unquot 


Unquot- 
Quo: 


Quot- 
Unquot 


Quot- 
Quot 


0 


0 . 39 


'119 


131 


4 0 


666 


150 


0-4 


1 . 93 


446 


303 


161 


74 5 


319 


0-9 


3.87 


639 


4 18 


210 


759 


368 



Table 1: Parents retrieved- for each search strategy 
Pi HCURFii on 

As expected, using the quoted portion of a message as 
25 a query (i.e., child message filter extracts quoted text 
portion) and matching against the unquoted portions of 
target messages (i.e., target message filter extracts 
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unquoted text) was the most effective strategy, of the five 
strategies tried, for finding a parent message. As shown 
in Table 1, the parent was the highest ranked message in 
666 out of 941 trials or 71% of the time (for the quoted 
5 query - unquoted target strategy) . Put another way, a 

system that simply assumed the highest ranked message under 
this matching strategy was the parent would, on average, 
have 0.71 recall (i.e., retrieval of 71% of the items 
relevant to the query message) and 0.71 precision (i.e., 
10 71% of the retrieved items are relevant to the query 
message) at finding parent messages. Of course, these 
results are for messages that are known* to have a parent 
message. An operational system would need not only to 
distinguish among potential parents, but also to detect 
15 whether or not the message has a parent at all. One way of 
accomplishing this is to establish a threshold which may 
be preset or specified by a user against which the 
ranking or similarity scores for the child and potential 
parent messages would be measured. If the highest ranking 
20 or similarity score falls below the threshold, then it 

would be determined that there is no "match", i.e., no true 
parent message for that child message. 

These results can be roughly compared with the 0.90 
recall and 0.72 precision in Salton and Buckley's 
25 experiments with Usenet messages containing quoted 

material. However, Salton and Buckley were attempting to 
find related messages, not just parent messages, and 
defined all messages with the same Subject: line as being 
related. The task undertaken by Salton and Buckley is a 
30 simpler task than finding the single parent of a message. 

Referring again to. FIGURE 1',' it is .apparent that the 
other strategies tried were not as effective as matching 
quoted text* against unquoted targets, though all were far 
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better than random at finding parent messages.- Even 
matching unquoted text .queries against quoted text targets, 
which preferentially retrieves the children of a message, 
returns a nontrivial number of parents based on general 
5 content similarity. ■ Similarly, quoted queries against 
quoted targets mostly should find siblings of a message, 
but gets -some parents due to nested quotations that persist 
to the child. 

How fast the number of parents gained drops off with 

10 increasing rank also depends on the matching strategy. As 
shown in FIGURE 1, the smoothest decay comes from matching 
unquoted material against unquoted material (the fourth 
curve in FIGURE 1) . This picks up parents based on a 
general similarity of content rather than repetition of 

15 actual text from the parent. The relatively smooth 

gradation of content similarity which shows up in typical 
text retrieval systems also shows up here. In contrast, 
the curve for quoted queries vs. unquoted messages drops 
off extremely sharply. In most cases only the single 

20 parent messages will have a large" block of unquoted text 
similar to the quoted text of the child. The curve for 
subject vs. subject {the fifth curve in FIGURE 1) drops 
sharply at the beginning, after the exhausting of those 
cases where there are nearly exact matches between the 

25 Subject: line of the query and a few documents with the same 
Subject: line. Later the curve is more gradual reflecting 
cases where the subject line is common to many messages, or 
the match is on only a subset of the words . 

The diagram in FIGURE 2 shows the flow of message 

30 processing in accordance with the pre-sent invention. At 
200 is a set of N target messages (denoted. 1, 2, N) , 

any of which may be' a parent message to be determined. 
Each target (potential parent) message at 200 is filtered 
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through a parent message filter A at 210. As seen from th 
experiments described above, parent message filter A may 
extract subject text, unquoted text, or quoted text from 
each message. The result of the message filtering 
5 operation is a set of filtered target (potential parent) 
messages (denoted 1 A , 2 A , N A ) at 220. Preferably, 

based upon the above test results, message filter A at 210 
extracts unquoted text from each potential parent, and the 
set of unquoted text messages for potential parents is at 
10 220. 

Continuing, the filtered potential parent messages 
(1 A , 2 A/ . . . , N A ) at 220 are then passed along to a 
Statistical Information Retrieval Function at 230. 
Statistical Information Retrieval Function 230 can be the 

15 SMART system described above or an equivalent 
statist ical*ly- - based retrieval function . 

The child, or reply, message CM at 240 is also 
processed using a message filter Q at 250- As discussed 
above, the child message filter may extract subject text, 

20 unquoted text, or quoted text from the child message, 

producing a filtered child message CM 0 at 260. Preferably, 
based upon the experiments described above, the child 
message filter at 250 extracts quoted text from the child 
message CM at 240, producing child quoted text at 260. 

25 The filtered child message CM 0 is then passed to the 

Statistical Information Retrieval Function at 230, along 
with filtered parent- messages (1 A , 2 A , . . . , N A ) . The 
Statistical Information Retrieval Function processes these 
message components to provide a similarity value table at 

30 270, which represents values (denoted AQ W , AQ 2 , . . . , AQJ 
. each of which is a measure of how likely it is that the 
corresponding message (1, 2, N) is the parent for the 

child message CM. 
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To determine the most likely parent message, the 
similarity value table at 270 is processed by a maximum 
value function at 280 from which the maximum value can be 
determined. The position in the table of the maximum value 
5 is a pointer or identifier at 290 that can be used to 

retrieve the corresponding target message which has been 
selected as the most likely parent message. This message 
can now be presented to the user along with the child 
message in a variety of formats, or simply retained for 

10 further processing to produce a thread. Alternatively, a 
list of potential message pairings -- with or without 
selecting which one is the actual parent may be 
presented to the user. 

As mentioned above, an alternative step may include 

15 establishing a threshold against which the ranking or 
similarity scores for the child and potential parent 
messages are measured, and if none of the rankings or 
similarity scores exceed the threshold, then it would be 
determined that there is no "match" , i.e., no true parent 

20 message for that child message. 

Generating a thread may be accomplished by iteratively 
applying the method of the present invention as described 
above. Starting with a perceived child message, a likely 
parent message is determined using the method. That parent 

25 message is then, substituted as a new "child 11 message and 
its parent (i.e., the grandparent of the original child 
message) is determined.- using the same method. Similarly, 
the grandparent message can then be substituted as yet 
another "child" message to determine its parent and so 

30 forth, so that ultimately a thread of messages having 

parent-child relationship between successive messages may 
be obtained. 

Another way to generate a thread of messages is to 
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process all messages as child messages against all other 
messages as potential parent messages (which, in fact, is 
the technique utilized during experimentation) . For each 
child message, its parent is determined as described above 
5 using a statistical information retrieval function and 

computing similarity values. Threads can be determined by 
linking up successive child-parent pairs. Linking of 
successive child-parent pairs may be done by, for example, 
finding a child message (denote as "B" ) having a parent 

10 message (denote as "A") wherein child message "B" is itself 
a parent message for another child message (denote as " C H ); 
that is, message "A" is the parent of "B" and the 
grandparent of "C" Thus, the link of messages would be 
"A" - "B" - "C" , and so on until all messages in the thread 

15 are accounted for. 

An alternative to the embodiment of the present 
invention described above may be used to obtain a likely 
child message given a parent message. The basic process 
using message filters is the same for the alternative 

20 embodiment. The differences in 'the process are the filters 
used. For example, in the experiments described above, the 
best results in determining a parent message given a child 
message were obtained by using a quoted text filter for the 
child and an unquoted filter for each of the potential 

25 parent messages. Starting with a given parent message, 
then, the process would involve the use of an unquoted 
filter on the parent message and a quoted filter for each 
of the remaining messages (the potential child messages) . 
Once the messages are filtered, the processing essentially 

30 takes place as described above. 

It is readily apparent that' one way of utilizing the 
present invention is with batch processing of messages such 
as, e.g., wd>uld be done in connection with message 
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archiving. Another way of utilizing the method of the 
present invention, however, is in the processing of 
incoming messages as they arrive, rather than waiting for a 
batch to accumulate. For example, when a new message 
5 arrives, the method of the present invention could be 

applied to identify a parent message from the messages that 
have previously arrived. In addition, in the event that 
the messages are received out of order, the new message 
could be checked against the other messages (in accordance 

10 with the method described above for locating a child 

message from a potential parent) in order to determine a 
child message for the newly received message. 

A variety of improvements in the basic processing 
scheme described above are possible. By improving 

15 processing of document text, as well as making use of 

additional evidence, it is believed that the above results 
can be greatly improved. The improvements, each of which 
might be viewed as a message "filter," are as follows. 

(1) Better Text Representation. The above-described 
20 experiments ignored the order of 'words when matching query 

messages against potential parents. This is sensible for 
detecting similarity of topic, as is the goal in matching 
unquoted text against unquoted text. A quotation in a 
child message, however, is likely to repeat a long sequence 

25 of words from the parent . Indexing , matching , and term 
weighting based on multi-word phrases or entire lines 
should greatly reduce the number and strength of spurious 
matches. Since header material (From: lines, etc.) can 
appear in quotes as well, matching should be allowed on 

30 this material as well. . 

(2) Nested Quotation. Multiple levels of quotation 
are common in ^electronic messaging, and are indicated by 
concatenated prefixes. For instance, if textual material 
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is prefixed by ">> >" , it would be expected that the parent 
message has the material prefixed by " > > M , or perhaps by 
" > M , but probably not by nothing and certainly not by n I " 
or "*" . Concatenated Re: tags appear in Subject: lines, but 

5 should be statistically characterized, since their use by 
mailers is erratic. 

(3) Time. Most replies to a message occur within a 
window of a few days after the message is posted. A simple 
statistical model, perhaps similar to those used in 

10 analyzing citation patterns, can be used to take this 
tendency into account. 

(4) Recognizing Other Message Relationships. 
Duplicated, bounced, reposted, continued, and revised 
messages have strong textual similarity to other messages. 

15 The experimental data showed cases where they were falsely 
construed as replies. If treated simply as nonreplies they 
are likely to distort statistical models distinguishing 
replies from nonreplies. A better approach is to model 
these other message relationships as well, both to 

20 distinguish them from response relationships and to provide 
additional useful links between messages. For instance, a 
mail reader might display a revised message while 
backgrounding the original. 

(5) Authorship Information. Replies often refer to 
25 the author of the parent message, either in an 

automatically produced fashion (such as) : 

lewis@research.att.com (David L . Lewis) writes: 
>l'd really like a threading email reader. 



30 



or via a mafhually ■ written salutation (e.g., Dear Susan). 
These may be matched against header information of messages 
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and manually or automatically produced signatures. 

(6) Cue Phrases. In responses which do not directly 
quote the parent message, the author will often use 
linguistic cues to indicate the parent message, e.g. I 

5 really like the suggestion that. . . or Your argument is. 
Considerable research which has been done on 
distinguishing what relationship a particular cue phrase is 
indicating can be applied. 

(7) Message Categorization. Certain types of messages 
0 such as calls for papers and job ads are unlikely to be 

replies to other messages and/or are unlikely to be replied 
to publicly. Known text categorization methods can detect 
these and provide evidence against the presence of response 
links. 

5 (8) Detection of Siblings. A message without a clear 

connection .to its parent may be similar to another child of 
the same parent, which does have a clear like. For 
instance, two people may post similar responses objecting 
to an error in the parent message, but only one uses the 

0 reply command. 

All of the above improvements are, in effect, clues 
that provide evidence toward the presence or -absence of 
response links, but in all cases this evidence is 
uncertain. A planned strategy is to implement the clues so 

5 as to reduce their uncertainty as much is as reasonable, 

but then to rely on machine learning methods known to those 
skilled in the art to combine these multiple uncertain 
clues into a decision procedure. This approach to complex 
information retrieval problems allows the system 

0 implementer to focus on the relatively clean task of 
building feature detectors, while'lett ing a learning 
algorithm use 4 t raining data to balance the uncertain 
relationship of those features to the property of interest. 
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(Two articles provide good examples of this strategy: B. 
Croft, J. Callan & J. Broglio, "Trec-2 routing and Ad-hoc 
Retrieval Evaluation Using the Inquery System, " in The 
Second Text Retrieval Conference (D.K. Harman, ed., 
5 Gait her sburg, MD, March 19 94, U.S. Dept. of Commerce, 
National Institute of Standards and Technology (NIST) 
Special Publication 500-215) pp. 75-83; and E. Spertus, 
"Smokey: Automatic Flame Recognition," Manuscript, Compute* 
Science Department, Massachusetts Institute of Technology, 
0 1996, submitted to ACM SIGIR ^96.) In addition, this 
approach allows the system to be tailored to user 
preferences as expressed, for instance, through their 
overriding of system decisions. This is desirable, since 
the presence of a response link is to some degree 
5 subjective. 

Each of the above- referenced improvements may be 
utilized as message filters alone or in combinations with 
one another and with the "subject text," "quoted text" and 
"unquoted text" message filters that were the subject of 
the experiments described herein. Accordingly, an 
embodiment of the present invention may be obtained as a 
generalization of the embodiment reflected in FIGURE 2 
described above. With reference to the diagram in FIGURE 
3, the flow of message processing for the more general 
embodiment of the present invention will now be described. 

As shown in FIGURE 3, at 3 00 is a set of N target 
messages (denoted 1, 2, . . . , N) , any of which may be a 
parent message to be determined. Each target (potential 
parent) message at 300 is filtered through a parent message 
filter bank (which may be one or more message filters) . 
The parent message filter bank is shown at 310 in FIGURE 3 
as a set of one or more message filters denoted by A, B, 
K, giving a parent message filter bank of length K. 
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Parent message filters A through K may extract subject 
text, unquoted text, or quoted text from each message, or 
they may implement one or more of the "improvements" in 
message analysis described above (such as, e.g., extractinc 

5 nested quotations, time information, or cue phrases) . The 
result of the filtering operation is a set of N filtered 
target (potential parent) message vectors (denoted 1 A , 1 

1 K , 2 A , 2 B , . .., 2 K , N A , N B/ . .., N K ) at 320, where 

each filtered parent message is a vector consisting of the 

0 K filtered representations of the message, i.e., each 
element of the vector is the result of one of the K 
filtering operations (e.g., filtered target message l is 
denoted as vector 1 A , 1 B , . . . , 1 K , where 1 A represents the 
result of processing target message 1 through message 

5 filter A, etc.). These filtered potential parent messages 
at 320 are then passed along to Statistical Information 
Retrieval Function at 33 0, which may be the SMART system 
described above or an equivalent statistically- -based 
retrieval function. 

0 The child, or reply, message CM at 340 is also 

processed using a message filter bank (which may be one or 
more message filters). In FIGURE 3, the child message 
filter bank is shown at 350 as a set of message filters 
denoted as Q, R, . .., Z, giving a child message filter bank 

5 of length Z-Q+l. The child message filter bank may contain 
one or more of the same type of potential message filters 
described above for the. parent message filter bank. The 
child message filter bank produces a filtered child message 
vector (denoted CM 0 , CM R , CM Z ) containing Z-Q+l 

0 filtered representations of the message at 360. 

The filtered child message vector (CM^, CM R , . . . , CM,) 
is then passed**to the Statistical Information Retrieval 
Function at 3 T 30, along with the set of filtered parent 
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message vectors (1 A/ l B , . . . , i K , 2 A , 2 B , . .., 2 K n A/ 

N B , . .., N K ) . The Statistical Information Retrieval 
Function processes these message components to provide a 
similarity value table at 370, with values (denoted AQ ly 
5 AQ 2/ . . . , AQ N/ KZ W KZ 2 , . . . , KZ N ) representative of the 
similarity between potential parent and child message 
components. It may be preferable to combine the columns of 
values in the similarity value table of 370 using a 
combiner function at 372 to provide a single tuple of 

10 values at 374, each element of which is a measure of how 
likely it is that the corresponding message (1, 2, N) 
is the parent for the child message CM*. As discussed 
above, the combiner function may be a decision procedure 
based upon machine learning methods. To determine the most 

15 likely parent message, the tuple of values at 374 is 
processed by a selector function at 380 from which an 
identifier for the most likely parent message can be 
determined at 390. For example, if the selector function 
is the maximum value function described above with 

20 reference to FIGURE 2, the position of the maximum value in 
the tuple of values is a pointer or identifier at 390 that 
can be used to retrieve the corresponding target message 
which has been selected as the most likely parent message. 
The selected message can now be presented to the user 

25 along with the child message in a variety of formats, or 

simply retained for further processing to produce a thread. 

Those skilled in. the art will recognize that in the 
latter-described embodiment of present invention, each of 
the parent and child message filter banks may consist of a 

30 single message filter or multiple message filters. Those 

skilled in the art will further 'appreciate that the present 
. invention may- be implemented in any one of a number of 
known ways/ For example, the present invention may be 
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implemented by integrating or combining the techniques of 
the present invention with an e-mail reader or browser 
software program. Such a program may be client -based 
(i.e., found locally within an individuals personal 
5 computer) or server based (i.e., found in a computer or 
gateway remote from the individual reader) . As another 
example, the present invention could be implemented as part 
of a client -based or server-based message archival software 
program. The advantages of the present invention do not 

10 depend upon the particular mode of operation (i.e., server 
or client) of a computer or processor through which the 
techniques herein described are implemented. It will be 
clear to those skilled in the art that the location of the 
messages that may be processed in accordance with the 

15 invention described herein need not be stored in the same 
location as the program utilized for carrying out such 
processing. Indeed, messages may be downloaded to a client 
station or to a message server from a remote location, such 
as, e.g., a message database accessible over the Internet 

20 or accessible over a corporate intranet. 

In summary, instead. of attempting to solve the e-mail 
threading problem by forcing more consistency in the use of 
structural links by client software, the present invention 
involves an approach to threading that makes use of a range 

25 of individually uncertain, . but cumulatively compelling 
clues as to what is going on in a conversation. 

What has been described is merely illustrative of the 
application of the principles of the present invention. 
Other arrangements and methods can be implemented by those 

30 skilled in the art without departing from the spirit and 
scope of the present invention. 
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DIRNOT.TXT Fri May 2 09:32:23 1997 1 

29-may-96 DDL : Copies of all the following files were senc to Gene 
Nelson today, for the preliminary patent application. 



arran S cd -/pro jects/Aactive/emailpatent/dircopies/project/ 



arran S Is 
total 12 
drwxrwxrwx 
drwxrwxrwx 
drwxrwxrwx 



-1R 

6 kknowles 
2 kknowles 
2 kknowles 



4096 May 29 09:01 bin 

4096 May 29 09:56 explorations 

4096 May 28 14:09 makefiles 



total 20 



-rwxrwxrwx 


1 


lewis 


1SS6 


May 


29 


08.57 


KEADME2 


drwxr-xr-x 


2 


lewis 


4096 


May 


29 


09:44 


dir format 


drwxr-xr-x 


2 


lewis 


4096 


May 


29 


09 :01 


di r queries 


drwxr-xr-x 


2 


lewis 


4096 


May 


29 


09:01 


dirscores 


drwxr -xr-x 


2 


lewis 


4096 


May 


29 


09:01 


dirutil 


bin/dir format 














total 96 
















- rw-rw-rw- 


1 


lewis 


6137 


May 


29 


08 : 57 


README2- formatting 


-rw-c--r-- 


1 


lewis 


30077 


May 


29 


09 :44 


all2.ps 


- rwxrwxrwx 


1 


kknowles 


LIS 


Jul 


21 


1995 


get -body 


- rwxrwxrwx 


1 


kknowles 


896 


Jul 


24 


1995 


get -quoted 


- rwxrwxrwx 


1 


kknowles 


607 


Aug 


1 


1995 


get -unquoted 


-rwxrwxrwx 


1 


kknowles 


607 


Aug 


23 


1995 


get -unquoted -digest 


-rwxrwx-wx 


1 


kknowles 


1083 


Aug 


23 


1995 


headerparser 


- rwxrwx-wx 


1 


kknowles 


475 


Jul 


17 


1995 


mai Ibodyl 


-rwxrwx-wx 


1 


kknowles 


S29 


Jun 


28 


1995 


mail body 2 


-rwxrwxrwx 


I 


kknowles 


791 


Jul 


31 


1995 


make-quoted- f i les 


-rwxrwxrwx 


1 


kknowles 


715 


Jul 


31 


1995 


make- sub} -files 


- rwxrwxrwx 


1 


kknowles 


992 


Aug 


23 


1995 


make -unquoted -digest 


- rwxrwxrwx 


1 


kknowles 


978 


Aug 


1 


1995 


make -unquoted- f i les 


-rwxrwxrwx 


1 


kknowles 


644 


Aug 


23 


1995 


parse-digest 


- rwxrwx-wx 


1 


kknowles 


2495 


Aug 


23 


1995 


parse-messages 


-rwxrwxrwx 


1 


kknowles 


693 


Aug 


22 


1995 


trim-digest 



bin/dirquer J.es : 
total 24 



-rw- rw-rw- 


1 


lewis 


2653 


May 


29 


08:57 


README2 - ruiui i ngquer x e s 


- rwxrwxrwx 


1 


kknowles 


1453 


Aug 


22 


1995 


query-irts 


-rwxrwxrwx 


1 


lewis 


856 


May 


28 


13:34 


query-quoted 


-rwxrwxrwx 


1 


kknowles 


1101 


Aug 


23 


1995 


query- run 


-rwxrwxrwx 


1 


kknowles 


575 


Aug 


23 


1995 


query- run - be t te r 


-rwxrwxrwx 


1 


kknowles 


304 


Aug 


7 


1995 


runsmart 

<* 


bin/ dirscores : 














total 64 
















- rw- rw-rw- 


1 


lewis 


53S3 


May 


29 


08 : 57 


README2 -comput ingscores 


- rwxrwxrwx 


1 


kknowles 


802 


Aug 


22 


1995 


child-parenc-rank- score 


- rw- rw- rw- 


1 


kknowles 


5210 


Aug 


25 


1995 


create_ful l_f i les 


- rwxrwxrwx 


1 


kknowl es 


1915 


Aug 


23 


1995 


f inish-cprs 


-rwxrwx-wx 


1 


kknowles 


1332 


Jul 


24 


1995 


gee - score 


- rwxrwxrwx 


I 


kknowles 


1513 


Aug 


23 


199S 


kim^oin 


- rwxrwxrwx 


1 


kknowles 


2105 


Aug 


21 


1995 


make-score 


- rwxrwx-wx 


1 


kknowles 


218 


Aug 


16 


1995 


make- score- 1 is t 


- rwxrwxrwx 


1 


kknowles 


465 


Aug 


21 


1995 


nuia- docs - ret rieved 


- rwxrwx-wx 


1 


lewis 


530 


May 


28 


13 : 44 


smartbatchl 


-rwxrwxrwx 


1 


kknowles 


2075 


Jul 


25 


1995 


test -children 


-rwxrwxrwx 


1 


kknowles 


4753 


Jul 


27 


1995 


test -irt 


- rwxrwxrwx 


1 


kknowles 


1788 


Aug 


16 


1995 


test-irt-better 


bin/dirutil : 
















coral 20 
















- rw-rw- rw - 


1 


lewis 


1013 


May 


29 


08 : 57 


READHE2 -ut i 1 S 


- rwxrwxrwx 


1 


kknowles 


405 


Aug 


23 


1995 


forall 


-rwxrwxrwx 


1 


kknowles 


622 


Aug 


14 


1995 


geomean 


- rwxrwx-wx 


1 


kknowles 


208 


Aug 


4 


1995 


rand-selector 


- rwxrwxrwx 


1 


kknowles 


886 


Jul 


21 


1995 


stdevchi ldparenc 


explorations 
















tocal 32 
















- rw- rw- rw - 


1 


lewi s 


14^58 


May 


28 


14 : 06 


004 -notes -using -smart -co- t ind- pa rents 


- rw- rw-rw- 


I 


lewi s 


13 480 


May 


28 


14 : 07 


00 8 -preparing -data- sees -digest- forma c 



make_f i les : 
total 12 
-rwxrwxrwx 
-rwxrwxrwx 
-rwxrwxrwx 
arran S 



1 lewi s 
1 lewis 
1 1 ew i s 



2009 May 28 14:08 make_text_quot 
1889 May 28 14:08 make_text_sub j 
1991 May 28 14:08 make_text_uquo t 
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README 2 Fri May 2 09:24:19 1997 1 

David D. Lewis and Kimberly Knowles. 29-May-96 

This directory concains all email threading software used in producing 
che results in: 

<?article{LEWIS96e 

author - 'Uavid D. Lewis and Kimberly A. Knowles'' 

title - 'Threading Electronic Mail: A Preliminary study' * 

journal - "Submitted for publication. Comments welcome.'* 

year = 1996 

copyrightown = * AT&T 1996*' 
} 

The only exception is the code of the SMART system, a publically 
available text retrieval system described in: 

9book{SALTON83 

author = "Gerard Salton and Michael J. McGill" 
title = -Introduction to Modem Information Retrieval" 
publisher - "McGraw-Hill Bock Company" 
year = 1983 
address = "New York" 
) 

The capabilities of the SMART system are very similar to those of o 
number of commercial text retrieval systems, such as SearchServPr from 
Fulcrum. Inc. of Ottawa, Canada. 

The files are broken up into 4 groups, as described in these four 
files: 

REAX>M£2- formatting : Formatting the raw documents, includina breaking 
up files of messages into individual messages, and pulling out cruoted. 
unquoted, and subject line material. 

README2 -runningquer ies : Using various parts of potential child 
documents as queries to be matched against databases of documents 

README^ -computi ngsco res : Combining the results of the multiple 
queries in different ways to get rankinas and scores. 

README2 - u t i 1 s : Miscellaneous utility functions. 

In this printout, the text of each of these files is followed ny r.r-.e 
source code for thac part of the software. 
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REAKIE2- formatting Fri May 2 09:15:53 1997 



David Z>, 


Lewi s 


and 


Kimberly Knowles. 29 -May-96 






get -body : 


5/21/ 


95 


runs mailbodyl for All tiles in dir 






ma i ibcayi 
smarcsac: 


i 1? 


.'9 5 


strips cut only che body part of a 


message . 


called by 


mai lbcdy2 


: 6 28 


, 9 5 


does che same for html (hypermail* 


formatted 


messages . 



gat-quoted: 7/17/95 mailbodyl Cor quoted texc only. a filter which outputs oniy 
the lines of arg which begin with any number of types or prefixes, and crop 
that par-. called by query-quoted to run smart on quoted material, called 
by mane-quotcd-f i les to make dir of files of quoted material. 

• Let through ]us: the quoted body part of a mail message 

• in mbcx format. iOnly one message in file.) 

• called on by query-quoted 
* 

• What i really wane is :o be able to denote a list o£ possible prefixes somewhere 

• and :-s; say if any ot these, then scrip chem and print. otherwise, skip it over 
- how auouc if '/patternl/ ;j .pattern2/ || ... || /pacternN/) ( s/\l//- print) * 



get-ur.quctea: 6/1.95 does cne same as gec-quoted but fcr unquoted (nonquotedj 
texc or. ; y . -a lied by ma*e- unqvioceo - f i les . 

■ Let tr.rcugh jus: unquoted part ot the body of a mail message 
« in nvbex format ;Cr;xy one message in file.i 



get-unqucted-diaest e'23,95 does the same as get -unquoted bur uses "Dace 
as a messaqe delimiter, instead c i "From " 

headerparser : 6/2? -95 removes listserv to kim headers from archive files 
whien were nailed in digest iorraac removes ail pmhead headers 

* Usage . 

* headerparser FILENAME DESTINATION 
f 

* This is to parse messages of archives that i receive from a 

* liscserver. che neader from the server to me is what 1 want 

* to get rid of. the archives, in digest format, can tje dealt 
« with separately or by expanding che program eventually. 

* Note we know cne first line is a "From * line. 

" Note aiso that the initial header is followed by a olank line. 

* So tt we look for a blank line after che 'From * line, we can 

* delete up to there and continue. In fact, since the rest is 

« in digest format, we Know that 1 Frcm • must signal the header. 



maxe-quoted- files : 7/31/95 makes a directory of files, calls on qet-quoted to 
filter tor quoted material only. 

* Usage: make -quo ted- C i les DIR DEST 

* where CIR is the source directory and DEST is che directory where you want che files 

* puc 

4 

* this will create files of just quoted rnacerial contencs 

#tor some reason, it is creatinq an empty dotfile. this will screw up che message id's 
« for smart, so oe sure to remove it.- i'm not really sure why it's happening. 



make-suo;-f iles: 7/29/95 prints subject line of all files in dir to outdir. one 
file per subject. a direct mapping of message to subject file. 
» Usage: make-sub) - f iles DIR DEST 
4 

■ chis will create files of just subject contencs. with Re: removed if 

* it is tnere. 

• for some reason, it is creating an empty dotfile. this will screw up che message id's 
» for smart, so be sure to remove ic. i'm not really sure why 1 1 ' s -happening . 

*V 

make-unquoted-diqes^- files : 8/23/95 same as make- unquoted- f 1 les . oniy it calls 
on get -unquoted-digest . 

» Usaqe : make-unquoted-digest - fi les OCR DEST 

# where DIR is the source directory and DEST is the directory where you want the files 

* put. 
# 

# this will create files of just unquoted material contencs. 
* 

•this was copied directly from make-quoted- f i 1 es . with a 3-characcer change. 1 guess - 
» if 1 were doing this a lot more times, id just take the filter script as a 
» pa rameter . 
* 

• for some reason, it is creating an empty dotfile. this will screw up the message id's 
i for smart, so be sure to remove it. i'm not really sure why it's happening. 
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make-unquoted- f iias: 7/31/95 prints unquoted text of files co cutdir. one tile per 
subject . 

4 Usage: make-unquoted- f i les DIR. DEST 

• wnere 01?. :s :.-.e source directory and DEST is the directory where you wane the files 
< put . 

« ' 

• this will create files of just unquoted material contents. 
• 

•this was copied directly from make -quo ted- files, with a 3-character change * guess 
I if i were doing this a lot more times, i d just cake the filter script as a 
it parameter. 

«for some reason, it is creating an empty docfile. this will iicrew up tne message id's 
« for smar:. so be sure to remove it. i'm not really sure why it's happening. 



parse-digest: c- 22/95 parses messages in digest format, using -Gate, 
del imi cer 

• Usage? parse-digest ir.fiie dir 
* 

4 This is a peri script meant to parse messages m digest rormat into 

« separate tiles, one message per tile. 

« 

• This will work similarly to parse-messages. 

• Messages are begun with the digest header "Date: 



parse-messages . 7/9 '95 uses ptnhead to cut messages into ir.div tiles called 
000000. itujox. takes archive file as argument, destination tile and number to 
append to are nardwired. well not any more. 
4 Usage: parse-messages 0UTDIR LIST OF ARCHIVE FILES 

* This is a peri script meant to parse messages in . mbox format into 

* separate files, one message per file, 
i 

i This will work similarly to 

« /home/ lewis* project s/ Aact ive/ emai 1 / bin/dlbreak? 

* but I den't think it will have to sort by mailinq list, since rhc 
« archives are already in the same file by mailing list. 

' Messages are begun with tne pinhead header -From <stuft> <Date» or 

* some approximat ion - 



trim-digest . 8/22/95 designed to take tne separater string out ni patsea message 
from digest archives. 

* usage: crim-digest messagedir 

* to trim out the last line of digest files parsed using parse-digest 

* i should Tust encapsulate them m the future. 
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get-body Fri May 2 09:15:53 1997 1 

• !/bin/Jcsh 
t 

prefix=Sl 

for file in S{pre£ix! '* 
do 

/home/kknowles/prc jecc/ bin/ mail body 1 S f ile 
done 

exic 0 
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get-quoted Fri May 2 09:15:53 1997 1 

It \ /usr/sbin/per 1 

• Let chrcugh just che quoted body pare ot a mail message 
i in mbox Eormat . (Only one message in file.) 

ft called on by ciuery-quoted 

• What i really want is to be able to denote a list of possible prefixes somewhere, 

» and just say if any of these, then strip them and print. otherwise, skip it over. 

• how about if (/patternl/ || /pattern2/ It - II /pacternN/l < s/\l//; print;) ? 

•Spatl = " ( (A-Za-z) ♦>) " ; 

Spat2 = * ( lA-Za-z) ■>♦)"; 

Spat3 ^ " t W ) ' : 

Spat4 = " ( lA-Za-z) *\\ \+) ' ; 

»$pat = "Spatil Spat2l SpatJ |Spat4"; 

Spat = -Spat2 ! Spat3 ( Spat4" ; 

while < <> > ( 

i£ I / " M \c* ) I Spat ) ( Spat | t \ s > ) * ) / ) t 

s/\i//; # will this worK? 

print . 

^lsif {/-•: i ■; ' « this doesn't wcr*; meant tc fLicor the evocations 
set eff by indents, 'message 16 has 3 spaces indentations, plus iacet on it has 
code indented (that's not a quot.ei. sigh, 
s / \ 1 / / ; 
print . 

1 



T 
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• ! /usr/sbin/perl 

K 

• Let chrough just uncruoced pare of the body of a mail message 

(t in mbox forma c . (Only one message in file.) 

» 

Spat2 = - ( 1 A-Za-z ) • >♦ ) " ; 
Spat 3 = " ( \ \ * t - . 
Spac4 = " ( [ A-Za-z I ' \ ^ I ♦ ) " ; 
Spac - -2pat2 | Spa tJ [ Spatl " : 

Sscace^O ; 
while (<>) ( 

if (Sscate -- 01 { 
if (/"From / > 
(Sscace* 1 : } 
elsif (/\S/> 

(die ( "Non-blank in line before From . in line $.");} 

) 

elsif (Sscate == 1) i 
if </-\s'S/ I 
{ Sstate=2 ; > 

) 

elsif {Sstace == 2) < 
if ( • /" . $/ ) 

(print unless < / ~ ( ; vs * ) ( Spat J ( Spac | (\sl I *>/l ;> 

J 

else 

(d ie <" Should r.cc gee here.").? 



T 
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* ! /usr/sbin/perl 

» Let through just unquoted part of the body of a mail message 

I* in mbox format. (Only one message in file.) 

§ 

Spat2 = ' ( (A-Za- z ) *>*)"; 
Spat 3 ~ ' (W ) - ; 
5pat4 = " ( [A-Za-2] *W|*I ■ ; 
Spat = "Spat2[Spat3 |Spat4"; 

Sstate=0; 
whi 1 e (<>) { 

if (Sstate -= 0) < 
if {/"Date: /> 
(Sstate«l; > 

elsif (/\S/) ^ 
(diel -Non-blank in line before Date: in line S. );} 

) 

elsif (Sstate = = 1 I ( 
if ( / * \ s * S / ) 
(Sstate=2 : ) 

) 

elsif iSstate =- 2) { 
i f i !/*.$/ » 

(print unless t / A t i \ s * )( Spat ) I Spat I ( \s ) ) * ) / > .1 

> 

else 

(die ( "Shoula not get here.");) 

} 

K END 



T 
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• ! /usr/ sbin/perl 
ft 

* Usage: 

ft header-parser FILENAME DESTINATION 
# 

* This is to parse messages of archives that i receive rrcm a 

u liscserver. the header from the server to me is what i wane 
it to get rid of. the archives, in digest format, can ce dealt 

* with separately or by expanding the program eventually. 

* Note we know the first line is a 'From ' line. 

ft Note also that the initial header is followed by a biank line 

t So if we look fcr a blank line, after the 'From * lir.e, we can 

ft delete up to there and continue. In fact, since the rest is 

* in digest format, we know that 'From ■ must signal the header 

Sarchive=SARGVtOJ . 
Sparsedarchive=SARCV f 1 ) ; 

open ( INFILE „ Sarchivel [| die -Couldn't open infile"; 
openCOUTFILE, " >Spar sedarchi ve" I || die "Couldn't open outfile" 
5state=0; 

while (<INFILE>) I 

if (Sscate == 0i [ 
i f ( / rt From * ) ( 
Sstate^l; 

) 

if (Sstate == 0) I 
print OUTFILE; 

> 

elsif (Sstate == li { 
if t/*\sS/» : 
Sstate=C. 

> 

) 

else { 

die ("Shouldn't get here* 1 ; 

) 



It 
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mail-body 1 Fri May 2 09:15:53 1997 

I ! /usr/ sbin/per i 

« Let through just the body part of a mail messaqe 
« in mtoox format. tOniy one message in file.) 
« cal led cn by smarcoatchl . 
i 

Sstate^O; 
while (<>» ( 

if [Sstate 0) ( 
if </*From /) 
(Sstate=l; ) 
elsif </\S/> 

(die ( •Non-blank in line before From . • ) : ) 

) 

elsif (Sstate = = 1) ( 
if (/~\s'S/t 
( Sstate-2 ; i 

} 

elsif (Sstate " 2i ( 
if ( r /" . S/ ) 
(print ; ) 

) 

else 

(die( "Should not get here.");) 

) 

I* END 
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« ! /usr/sbin/per 1 
ff 

i Let through -just the body part o£ a mail message 

n in html format . (Only one message in file.) 

I 

Sstate-U ; 
wh i 1 e ( <> » ( 

it (5scate —01 { 

if ( / ~< \ ' - - body="start' -->/) 
{ Sscate=l ; > 

) 

elsif (Sscate == 1) ( 
if ( ! / <p> \ s* / ) ( 

if ( / " < \ ! - - body- 'end" -->/) ( 

$state=0 ; ) 
else { 

chop; 

chop . 

chop; 

chop; 

chop; 

print - S_\n" . } ) 

» elsif (Sstate =■ 2) ( 

# print. 

# Sstate=l; 

# ) 
else 

(diet "Should not get here.');) 

) 



t 
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• ! /unr/sbirwperl 
• 

I Usaqe. make-quot ed- t i L es DIR DEST 

• where DIB is the :;ource directory and DEST is r.he directory where you want thy tiles 

• put . 
if 

« this will create tiles of just quoted material contents. 
* 

• for some reason, it is creatinq an empty dotfile. this will screw up the message id's 

• for smart, so be sure to remove it. i*m not really sure why it's happening. 



* S / = - - . 
»S*=1; 

opendirdNDIR. SARCV[0] ) . 
9inf ile-readdir ( IMDIR) . 
closedir ( INDIR) ; 

Squotdir=SARGV( I ) . 

mkdir t Squotdir . 075 5 i . 

foreach (Ginfilet i 

Open ( MSGFILE . SARCV(C) . ■$_) ; 
/ ( 0\d* ) .mbox/ . 
$f lname=Sl; 

open(OUTFILE. " >Squotdi r$ £ lname . quot " ) .- 
open f QUOTE, -get -quoted SARGV{0)$_ 1">. 
while ( <QUOTE> ) i 

print OUTFILE $_; 

t 

close<0U0TE> . 
close (OUTFILE) ; 
close (MSGFILE) ; 

> 



t 
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make-subj -files Fri May 2 09:15:53 1997 1 



« • /usr / sbin/perl 
t 

l Usage: make- sub} - £ x les DIR DEST 

I chis will create files of just subject contents, with Re: removed if 

♦ for somfrclson. it is creating an empty dotfilc. this will screw up the message id's 
ft for smart, sc be sure to remove it. i'm not really sure why it's happening. 

S / = " •• ; 
S • = 1 ; 

opendir ( INDIR. SARGV10J ) ; 
9in£ile=readdir t INDIRi : 
closedir < INDIR) : 

Ssubjdir=SARGV[ U ; 

mkdir < $subjdir , 07 55 ) ; 

foreach {Qinfile) ( 

Open(MSGFIL£. SARGV[01 . S_) ; 
/ ( 0 \d+ ) . mbox/ . 
S £ lname= S L ; 

open (OUTFILE. " > Ssub jdir $ £ lname . subj - ; ; 
while <<MSGFILE>i ( 

if (/'Subject: (Re: )?< *W) £ 
Ssubj=$2 ; 

print OUTFILE "S2\n-; 

) 

) 

close (OUTFILE) : 
clcselMSGFILEi : 

i 
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make-un<iuoted-<iisr«st-f il«s Fri May 2 09:15:53 1997 1 

it ! /usr/ sbin/per 1 
• 

* Usage: make-unquoted-digest- f i les DIR DEST 

» where DIR is the source directory and DEST is the directory where you want the files 

* put . 
< 

* this will create files of jusc unquoted material contents. 
* 

•this was copied directly from make-quoted- f i les . with a 3 -character change. i guess 
I if i were doing this a lot more times. I'd iusc take the. filter script as a 

* parameter. 
« 

* for some reason, it is creating an empty dotfile. this will screw up the message id's 
•for smart, so be sure to remove it. i'm not really sure why its happening. 



• S/= ' - ; 
#$•=1 : 

opendir ( INDIR , SARGVf 0] > ; 
Ginf iiesreaddir ( INDIR > . 
closedir t INDIR) ; 

Squotdir=$ARGV| 1 | ; 

mkdir (Squotdir, 0755) ; 

foreach leinfile) ( 

open t MSGFILE, SARGV(O) . $_> . 
/ ( 0 \d+ ) . mbox/ ; 
$ £lname=$l ; 

open ( OUTFILE. " > $quo tdi rS f lname . uquot " ) . 

open t QUOTE, "get -unquoted-digest $ARGV10!$_ | _ >, 

while (<QU0TE>) { 

print OUTFILE S_; 

) 

close(QUOTE) ; 
close(OUTFILE) ; 
close (MSGFIX-E ) ; 

) 
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ft ! /usr / sbin/per 1 
ft 

ft Usane: make-unquoted- £i les DIR. DEST 

« where DIR. is the source directory and DEST is Che directory where you want the files 

ft put. 

ft 

ft this will create tiles of just unquoted material contents, 
ft 

ftthis was copied directly tram make-quoted- £ i les , with a 3-character change. i guess 
ft if i were doing this a lot more times, i'd just take the filter script as a 
ft parameter, 
ft 

ftfor some reason, it is creating an empty dotfile. this will screw up the message id's 
ft for smart, so be sure to remove it. I'm not really sure why its happening. 



ft $ • ^ 1 ; 

opendir i INDIR. SARCV (01); 
«»ir.file=readdir ; ZMDIR) ; 
closedir { INDIR ) ; 

Squotdir- SARGV { 1 ) ; 

mkdi r t Squotdi r . 07 55 ) . 

foreach (Qinfilei t 

Open ( MSGFILE . SARGV(O) . 
' 1 0 \ d* ) . mJbox.- ; 
St'Iname = $l . 

open ( CIJTFILH: . " > Sc/uotdi rS f lname . uquot " ) ; 
open(OUOTE. "get -unquoted $ARCV[0)S_ )"); 
whiie (<QU0TE>) t 

print OUTFILE S_. 

1 

close ( QUOTE 1 . 
close ( OUTFILE i ; 
close { MSGFILE) ; 

) 
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# ! /usr / sbin/per 1 
i 

* Usage: parse-digest ir.rile dir 

! This is a peri script meant to parse messages in digest format into 

It separate files, one message per tile. 

t 

V This will work similarly to parse-messages. 
# 

* Messages are begun wizh the digest header -Dace: 

t We will be using strir.gs with multiple lines. 
$• = 1; 

Kmessages that you have. the first one written to will be 1* this number. 
Snum- " 00000 " , 

$indir=$ARGV(0] . 
Soutdir=SARGV[l] ; 

open UNFILE. Sindir). 

while (<INFILE>I ( 
i £ ( /"Data : / ) { 

open(OUT_ARCKI*.-i. " >Soutdir / Snunr -.mhox.-: i • die -couldn't open file"; 

> 

print ( OUT_ ARCH I VZ 

) 
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I !/uar/«J>ln/p*ff 1 
« 

« u««««: p*rt«-Mn«9«i outoih list or AACMIYT FILM 
• 

t Tftla Is « p*xl acnpc M«ni co (Mrs* mii*9«i In .mttam (mmc into 

• savurac* £il*a. on* Mingi p*r til*. 
« 

• Tbla will work *l»41*riy to 

• /heaw/l«wka/pro?«««.a/A«cctv*/«MLl/blA/dlbrMlil 

• but I don't tniok ic will h*v* to ion by Mtling U«t. alnc* Co* 
t archlvaj* *« *ir**4y in en* a w Cll* by Mi lino Hat. 

• 

4 Waaapoa *r* b*gun wicft ch* pinb**d h**d*r 'Pro* «*tuc.f» «0at*»* or 

• a nan «pnraxi**tlon . 
• 

» w* will b* using itrin«i with auilcipl* llnaa. 

•i/uar/abin/p*rl 

* 

♦mimgat Ch*t you tvav* . ch* first on* wxtctan to wlII b* I « tltt* nuWar . 
IruafOOOOO* . 

I loutdl r- *c radish /0T0/ pro jtTaln/»*w**»*«* : 

t/ra41sh/070/n*tlist*'archtv**/www«*lk/wwMVTalk_199l .archw* 

• /r*dl«h/070/n«tliata/*rcniv*«/wwwwlk/M**_Talk_1992 .archive 
»/r«di«h/0Y0/i«aeliata/«renLv«s/wwwc*lk'iN««_Talk_l»9]qi.*rcfalv* 
*'r«4ish/070/n*cLl*c«/*rchiv**/www«*lk/M»M_T*lk_lMJQ / 2 .archiv* 
«/ radian/ 070 snacl lata s*rch.iv**'wwwt*lii/ www-calk- 9403 

• /r*di*h/070/n*tli«t*/*rchiv*«Jwwwc*lk/www-t«lk-»40) 

• /r*4i«h/070/n*t list*/ archiv** /wwwra>lk/www-t«lk-9404 
t / r«di shy 070 / nac 1 1 a t a / arcnt v«« y wwwC4 1 k > www- «.* 1 k • 9 4 0 ) 

• / r*di alt/ 07 0 / n*tl i at*/ archi v** / www*, a 1 k/ www- is Ik - 94 0 « 
•/radisft/070/n«xi tacay *rc«*v**/wwwtaik/ www- talk-9407 

*««rcniv*a* t •/ra4ian/070/n*cllsta/*reRlv*«'www«alk/www-t*.lk-940a • . •/r*dlah/Q70/n*tlL*ta/arctuv«aJwwwt.*lk/www-ialk-94a3- * / radish' 010/ n 
*cll*ts/*rcnlv**'www«».lk/www-c*lk-9404*. wr*dish/070/n*til«ts/*rcftlv«*/wwt*ik/www-<*lh-940V . ■ / radlah/0?0/n*ti uti/«rcnivn/v«wt«lk/w 
ww-ulk-9404* . * /radios/ 070/ n«cl la t«/arcklv««'wwwulk/ www »t*Uk-> 9407 M j 

s«*rcblv*«« i - / radish / O70 / n*tl 1 a t* ' a rchivwa/wwwtalk/ www- talk- 9401* . * /radiah/070/o«t.liscssa.r<:hiv*s/wwwialic/ww talk- 9409* . * / radl ah/070/n 
*cllscs'*rchlw«a/wwwt«lk/www-t*lk-94X0*. */radi«h/070/n*tiia-tas«rcniv**/wwwT.aL«s*ww->calk-94 I l • . ~/radiah/07a/n*cli*te'*rcrtiv*a'wwwt*lk/w 
ww-t*lh-»413' . -/rad&aA/07O/n*tll«ea/*rekiw*«/www«*lk/www>-Mlk-9S01*i ; 

• Oils w«a )u>c CO CMC: 

••orcfclvwa* t */r*dlafi/07a/n*tli*ts/*rcr>lv««/wwwratk/www-t*lk-940S-) • 

• t think thia will work. 

tOUt4tr<«MtCI«MOV| : 
♦*rct>lv**»aSACV. 

toe«n (1HTXLC. •/ radtsh/OTQ /rv*tli«t«/*rcMv«*j aaartpwopl*/ aBMrtp*opl*949) l . 
•op*n < IMTtlX. */r«.diatw070/n*tl&acar*rcDlv*a'wwwt*|k/vww-c.«lk-94iO*1 : 

Mill* 1«XHrIL*>l ( ^ 

Cor*«eb i«*rehiv*«> t 
oo*n<lHTIL*. *_> . 
wftll* KlNTtLX>t < 
it (/"Fro* /i ; 

opanlOVT^AJLCHXVX. -»$out<Ut/«nua' -.*do**i || 41* -couldn't oc*n riia" : 

> 

print fOtJT.AJICHIVtl ; 
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trim-dig«st Fri May 2 09:15:53 1997 1 

•! /usr/sbin/perl 
* 

* usage, t r im-digest raessagedir 

N 

* co :rim out the last line of digest fiies parsed using parse-digest. 

* i should just encapsulate them in the future. 

Sthedir=SAKCV10l . 

opendi r ( INDIR . CthedirJ 1| die "Couldn't open indir"; 
9messages=readdxr ( INDIR) .- 
closedir t INDIR) . 
shift (Qmessages) . 
shift < ©messages) ; 

foreach (Omessages) { 

'itiv Sthedir/S_ tmp' ; 

open(M*JG. "tmp" > || die -Couldn't open infile: $!"; 

open(OUT. ->Sthedir/$_" 1 II die "Couldn't open outfiie'; 

Sb=<MSG> ; 

Spl ine=Sb; 

while (Sa=<MSG>) { 

print CUT Spline. 

$pi ine=Sa : 

J 

it (Spiine ^ - /'-♦S/> ( 
1 

el se i 

print OUT Spline; 

} 

close (MSG) : 
close (OUT! . 

) 

* rm tmp ' ; 
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David D . Lewis and Kimberly Knowles. 29-May-9$ 



que:y-ir:s: 3/7/95 takes a filename of index, directory of files, specfile. and outfile as 

arguments, runs smart using each file in the dir as a query, and prints it to 

outfile. num_wanted is a variable in the script, and can be changed appropriately. 

i • /usr/ sb in /perl 

# 

4 Usage: query- irts keydir dirquery spec outfile 
l 

# first, make an array of the childids we are interested in. taken from file 

♦ keydir (first column) 

t then, run smart for each of them, 
f store it in outfile. 

* we should input a directory for the query files, and the spec file. 

• SARGV10) =f ile of child-parent pairs by message number 

• SARGV ( 11 =di rectory of single file queries, string ending with **; muse contain 
i /subjects/ . /quotes/ . or /unquotes/ 

« SARGV I 2 1 = location of corresponding spec file 
t SARGV | 3 ) =where to put the output 

NOTE: YOU SPECIFY TYPE OF QUERY BY GIVING IT A DIRECTORY 



query-quoted: 7/17/95 smartbatchl for only running quoted material as a query, 
calls on get-quoted. 

* query-quoted PREFIX SPEC NUMOUT > OUT 
« 

4 Description. This program is a demons tratisn of operating 

* smart in batch mode. Iterate over several documents (one 
« per file), run them as smart queries, and save the 

tf results for the top 10 documents. 
* 

* history: adapted from smartbatchl 

NOTE: HARDCODED TO USE QUOTED MATERIAL AS QUERIES 



query-run: 8/1/95 like smartbatch. only it doesn't filter the document as i - 
comes in, it assumes the document that comes in is the one we wan: 

* Usage: query- run PREFIX SPEC NUMOUT > OUT 
ft 

» Description: This program is a demonstration of operating 

* smart in batch mode. Iterate over several documents (dtie 

* per file), run them as smart queries, and save the 

* results for the top 10 documents. 
« 

* This version runs on each file, in its entirety. therefore, i is x.eant tor use when the 
n files are preparsed. and the entire file is used as a query. This removes che need to 

n filter out the query by piping it from another script. 



query- run-bet ter : 8/23/95 same as query-run. onJy it takes more as :r.yu:s and 
doesn't require a pipe to output. 

* Usage: query- run - bet ter indir spec numout outfile 
4 

* similar to query-run; runs smart on each doc in lr.dir across docc 
» in spec. 

Sindir=SARGVlO ] ; 
Sspec=SARGV| 1 J . 
Snum . winced- SARGV { 2 I ; 
Soutf ile=SARGV(3i ; 



runsmart : 3/7/95 runs a file in smart, given file, spec, numcuc 
H runsmart FILE SPEC NUMOUT 
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<iuery-irts Fri May 2 09:17:12 1997 1 

i ! /usr/sbin/per L 
i 

* Usage: query-irts keyfile dirquery spec outfile 
t 

t first, make an array of Che childids we are interested in, taken from file 

• keydir (first column) 

* then, run smart tor each of them. 

• store it in out £ i le . 

* we should input a directory for the query files, and the spec file 
K SAKCV(O) =f i le ot child-parent pairs by message number 

I SARGV 1 1] =di rectory of single file queries, string ending with *.'*.- must contain 

# / subjects/ , /quotes/ . or /unquotes/ 

V SARGV | 2 ] = location of corresponding spec file 
I SARGVI 3 1 -where to put the output 
Skeydir=SARGVl01 ; 
Squerydir = SAJtGV ( 1 ) . 
Sspec=SARGV[2 1 ; 
Soutti le=SARGVf 3 J : 



* first we build the array of the raessageids we want to use as queries 
*open< INDEX , " / radi sh/ (HO /pro 3 era in/ chi ld-parent -pai rs -irt-id" 1 , 
Mopen ( INDEX. " / radi sh/ 070 /pro j test lOJc/ chi ld-parent -pai rs - irt - id" > : 
*open< INDEX, "chi ld-parent -pairs-irt-id"l; 
open (INDEX. Skeydirl ; 
while <<INDEX>> f 

S idnum^substr IS_, 0. ^) ; 

push iechi idids . Sidnumi . 

> 

if t 3querydir=- / subject s / J ( 

Ssuf f ix=" . subj ' ; ) 
elsx £ i Squerydir = •- ' unquotes/ 1 ( 

Ssuf t ix=" . uquoc " / 
elcif { Squerydir-- / quotes / ) < 

Ssuf f ix= " .quot " ) 

Snuro_wanted=3000 ; 

foreach <<?childidsl ( 

Squery ti le= ■ Squerydi rS_Ssuf f ix" . 

open ( SMART . "runsmart Squeryfile Sspec Snuun_wanted | " ) ; 
open (OUTDIR. " >>Sout£ile" ) : ' 
wh 1 1 e ( < SMART> ) { 

1: (/No documents to display/) { 
print OUTDIR ~S_\n"; 

} 

else { 

prinr OUTDIR. 

1 

') 

close IOUTDIR) , 

) 
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«!/bin/ksh 
» 

• query-quoted PREFIX SPEC NUKOUT > OUT 
f 

» Description: 

» Iterate over several documents (one 

• per filel , run chem as smart queries, and save che 

• results for the cop 10 documents. 
• 

• history: adapted from smartbatchl 
• 

t »et -x 

prefix-Si « directory of files which are indiv. messages 

spec= $2 « locacion of spec file 

nuntout3$3 * number of records per query 

for file in S (prefix) • 
do 

rm / cmp/ tmp . query-quoted* 
echo Sfile 

cat Sfile | get-quoted > / tmp/ cmp . query-quoced-daca 
s=S Iwc -c / cmp/ cmp .query -quo ted- data ) 
sl*S(echo Ss ) nawk '(print SI)': 
if (lsl»-0>) 

Chen princ "No quoced material" 
else 

cac -/projecc/queryprefix / cmp/ cmp . query -quoted -data / radish/ 070/ cmp/ querysuf c i x * / cmp/ cmp . que ry-quc ted -query 

cac /tmp/ tmp. query -quoced -query ! smart incer Sspec verbose 0 num_wanced Snumout 

f i 
done 
exi z 0 
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1 



• i /bin/ksh 
* 

• Usage: query- run PREFIX SPEC NUMOUT > OUT 
tf 

• Description: This program is a demonstration of operating 

* smart, in batch mode. Iterate over several documents (one 

# per file) . run them as smart queries, and save the 
It results for the top 10 documents . 



• This version runs on each file, in its entirety. therefore, it is meant for use when the 
» files are preparsed. and the entire file is used as a query. This removes the need to 

# filter out the query by piping it from another script. 



for file in StprefixJ" 
do 

rm / tmp/ tmp . query- run* 
echo Sfiie 

cat Sfile > / tmp/ tmp query- run-data 
s = S ( wc -c / tmp / tmp . runsmar t -data J 
sl=S(echo Ss | nawk (print SI)') 
if (tsl= = 0M 

then print "No quoted material" 
else 

cat / radish/ 070/ trr.p/querypre fix / tmp/ tmp . query- run-da ta ' radi sh/ 0*7 0 -• tmp ■ querysu f t ix > / tmp/ tmp . query- run -query 
cat / tmp/ tmp . query- run -query i smart inter Sspec verbose 0 num. wanted inumout 
f i 



• set -x 
pref ix=Sl 
spec=S2 
nmnout - S 3 



* directory of files which are mdiv. messages 

* location of spec file 

K number of records per query 



done 
exi t 0 



T 
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H ! i usr/ sbin/per 1 

# Usage: cjuery- run -bee tier indir spec numout outtile 

» similar to query-run; runs smart on each doc in indir across docs 

# in spec . 

Sindir=SARGV(0] ; 
Sspec=SARGVt 1 1 ; 
Snum_wanted=$ARGV ( 2 ] ; 
Soutf ile=SARGV(3 } ; 

opendir ( INDIR , Sindir); 
8£iles=readdir( INDIR) : 
closedir ( INDIR) ; 
shitt <@eiles) ; 
shift (Qfiles) ; 

foreach (<?files; { 

open ( SMART , "runsmart SindirS_ Sspec Snum.wanc ed { " I . 
open (OtJTDIR, ' > > Sout f i le " > ; 
while ( < SMART > ) ( 

if f /No documents to display/) ( 
print OUTDIR M $_\n"; 

) 

else i 

princ OUTDIR; 

) 

cioii^ JOUTOIRl: 
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• ! /bin/ksh 

• runsmart FILE SPEC NUMOUT 
• 

• Description: This program is a demonstration of operating 

• smart in batch mode. Iterate over several documents (one 

• per file), run them as smart queries, and save the 

• results for the top 10 documents 
f 

•set -x 

file-Si • directory of files which are indiv. messages 

spec* S 2 i location of spec file 

numout«S3 # number of records per query 

m / tmp/ trap . runsmart • 
echo Sfile 

cat Sfile > / tmp/ tmp . runsmart -data 
s=S < wc -c / tmp/ tmp. runsmart -data ) 
sl-S(echo Ss | nawk {'print SI} - ) 
if Usl==0») 

then print "No quoted material* 
else 

cat /radi S h/070/tmp,queryprefix / tmp/ tmp . runsmart -da ta / radish/070/ tmp/querysut f ix > /tmp/t-p r-insma— ou.r 
f . cat /cmp/ tmp. runsmart -query | smart inter Sspec verbose 0 num.wanted Snumout ' " " ^ 



exit 0 
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David D. Lewis and Kimberly Knowlcs, 29-May-96 

child-parent-rank- score: 8/22/95 scans through Cull-query-* tiles and prints 
out the chiid:parent <rank> <score> for each combination of found retrieved 
parent tor :he child 

4 usaqe. era ld-parenr - rank-score INFILE OUTFILE 

* Scans chrouah f u 1 i ■ Query- * files and prints out the 

* chi Id. parent <ranx> <score> 

* for each retrieved parent for the child in OUTFILE . 
• 

* Assumes the lines m INFILE contain filenames of form 'Onnnn. " followed 

* by suffixes cruet, unquot. subj. or mbox. 



create_full_f iles : 8/25/95 a full "make" sort of file template. meant to 
operate from all stages of processing data, it is meant for copying to the 
appropriate place and customizing location and style of data (mbox format 
versus digest, whac sort of lists of children and parents are used, how 
much as to be processed, etc> 

# usage: creace_£ul I_f i les 2> eff-log 

« shell program co run lots of stuff overnight: 

I assumes a directory called messages/ in current d;r. which contains 

• mdiv. messages that comprise the da case t 
< some global parameters 

homedi r = ~ /radish/07C/projtestsibs" 
dbase_name- "pro^cescsibs" 

index f ilename= " child-sibling- pairs - i rt - id" 
message_source= - / radish/ 070 /pro j test lOk/messages/ " 

* if this is a source for sibling lists 

pairs_list= M / radish/ 07 0/ pro j test I Ok / chi ld-parent - pa i rs - irt - id" 
#be sure to check the make_unquoted_ ( digest -) f i les line. 
NOTE. THIS IS TOP LEVEL DO THE WHOLE THING 



finisn-cprs: 3.' 11/95 to replace 'nr' in complete cprs file with appropriate 
max numbers from num- retrieved- • files. 
» usage: fmish-cprs FILE * OUTFILE 
* 

• to run on an 11 -column file of ranks, such as cprs-". wiii replace nr with 

• 0.00 in the score column and the lowest rank from files that are hardwired: 
£files- l "num-retr ieved-quocq-quot " , "num-retr ieved -quo tq- unquot * . 

"num-retr ieved-unquotq-quot " . "num- ret r ieved -unquotq- unquot "•, 
" num.- ret r i eved-sub j " ) ; 
NOTE. ASSIGNS RANK TO NONRETR I EVED ITEMS 



get-score: 7/18/95 cakes a dataset file as arg, finds child (first* message 
number, and parent (second) message number, opens a record of smart runs 
using sub £ind_score. which finds the rank of 2nd when using firsr. as a Query, 
based on data in a hardwired file, and prints results in 3 fields, 
called on by n.axe - score- 1 is t . 

* usage: get-score FILE 
i 

« tnis is called on by script make-score-iist. to aet the message numbers of 

* each child and parent, and the corresponding rank of the parent. 



kimjoin: 8/11/95 301ns two cprs-' files together, and does the riqnt thing. 

* usage: kimjoin fiiel file2 outfile 
* 

* takes two files, sorted numerically, of data of form 

* <childid> <parent id> \ t t <rank> \ t < score> ) x N 
1 and joins it che right way. with 

* <chi ldid> . <parent id> \ t ( <rank> \ t<score> ) x Nl \t (<rank>\t<score>lx N2 

* replacing the rignt number of fields with nr when there was no data from 

* the file. 



maKe-score: S 4 adapted from make-score- 1 ist and get-score, takes a drun file 
and created a score tile. 

I usage: make- score KeYFILE DRUN_FILE OUTFILE 
• 

* takes key tile index of child-parent message numoers. file of output from running 

• smart, ana outfile. prints out childnum. parentnum, rank the parent came back at 

* in oucfiie. if parent not returned, says 'not in top Sn*. where Sn is the number 
■ of documents returned. 

# caKen frcrr. -.- project / bin/ make- score- 1 1 st and qec-score, adapted for better 

n i 'O stuff . 



make- score- 1 i sc : 7/18/95 pr 
files in dir. 

# make-score- i : st DIRECTORY 



ints Labels 
> OUTFILE 



in i fields, 



runs get-score tor all 
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num-aocs-retneved: 8/10/95 councs up number of docs retrieved from f u 1 1 -query - * file 
and prints them <chi ld> <nura_recr ieved> in file. 

* usage: num-docs- retrieved INFILE OUTFILE 

• goes through a full-query-* file and prints to out: file the number oi 
« documents retrieved for each -child. 



smartbatchl : 7/ 1 1/95 cats smart commands and results of mailbodyl. and runs 
smart on each file in dir. 



test -chi Idren : 7/25/95 like gec-score and make-score-list, uses sub amichild to look 
for two indicators chat a message is a child, and runs it for every file in a prefix. 

* Usage: test-children PREFIX OUTFILE NUKOUT 
• 

* model after make - score- L l st and get-score, except we want it all in perl, so 
» we can recognize the filenames as well. 

# 

* okay, first we want to grab everything in a file. i think we can pipe in 

* filenames from shell command find to open (FILEHANDLE, "find <blah> | ■ ) or 

* something . 

* then we want to open each one and scan to see one cf those things. 

* then we want to print out what the final status was. childid and Tor F 

N 

* we can use return Svariable for passing values to and from subroutines. 



tes- - lrt-bettei : H/4/9S does the same as test-irt. only in 2n msrpAd n~2>. 

# Usage: t est - irt - better PREFIX OUTIDS 
« 

# meant to create a file for outputting the childid' s and ppids ot 

# messages which have message- id's in the m-repiy-to header. 



t 
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< ! /usr/ sbin/per I 
t 

t usage; chi ld-parenc-rank-score INFILE OUT FILE 

* Scans through full-query-* files and princs out Che 

* child. parent <rank> <score> 

I for each retrieved parent for the child in OUTFILE. 
$ 

* Assumes the lines in INFILE contain filenames of form 'Onnr.r. 

* by suffixes quot. unquot. subj , or mbox . 

Sinf ile=»SARGV|0I ; 
Soutf iie^SARGVtl I ; 
open ( FULL — QUERY . S i n f i 1 e ) . 
open (OUTFILE, " >Soutf i le" ) ; 
DATA: 

while) <FULL_OUERY>» ( 

if ( / \ /0+ ( \d» ) \ . (quot | unquot |subj |mbox} / 1 ( 
Schild=Sl : 

) 

elsif (/"Num/l { 
£rank=0: 

} 

elsif l/^No tq)d]/l f 
/ 

else t 

Srank**, 

(Sparent . $score. $ticle»=/*\s+( \d* > \s~ I (01) \ . \d* ) \s l > / . 
print OUTFTLE "Schild. Sparcntv tSrank\ tSscore\n" ; 
i push<9daca. "Schild: Sparent . Srank : Sscore - ) ; 

) 

Close (OUTFILE) ; 
Close (FULL_QUERY1 ; 



f o 1 loved 
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• * /bin/ksh 
• 

» usage: create_cuii_£ lies 2> r£f-lcg 

• sheii program to run iocs oc stuff overnight* 
* 

• assunes a directory called messages, in current dir. whicn contai 

• indiv. messages tr.at comprise ;na dataset , 

• some globa 1 parameters 
ncjm«dir=-/ radish/ 070. pro} testsibs * 
dbaae_n*me« *pro} cescsibi'- 

indaxf ilena»e»-chi id- sibling -pa irs- irt- id* 
mesaage_source»- / radish/ 0 70 /pro 3 ceir.1 Ok /massages / • 

• it this is a source for sibl ing 1 ists 

pairs. Use ' radian/ 070/ pro? test 10k/ chi Id-parent -pairs- irt- id" 
ibe sure to cheCK the make_unquoted_ ; digest - 1 f 1 les line. 

mkdir S (homedir) 

mkdir S ( homedir ) • messages 

cd S ( message .source I 

forali "Cp REPL S ( homedir J / messages . • 
cd S(noraedir) 



• make and index of the answers 

• cesc - irc*bec tar messages.' c.i; id-parent -pai rs -man- id 

• or. tar creating the sibling lis:. 

cp Stpairs.list) S ( nomedi r» / chi id-parent -pa 1 rs - irt- id 
sort -n -1 chi id-par er.c -pa i rs - irt - * sort ea-cp-pair s 
gec-sios sorted-cp-pairs cs-pairs 
sort -n cs- pairs > cm id- sin i i.-g -pa 1 rs - 1 rt - 1 d 

t generate :ne data tiles cn extracted text 
max e- quo ted - £ 1 les messaqess quotes 
m quor.es/ . guoc 

tiraake- unquc ted- tiles messages / unquoces .- 
make-unquot ed-digest - £ 1 ies messages- unquotes/ 
rm unquotes/ . uquot 

make-suoj-t iles messages/ subjects/ 

rm subjects/ . sub} 

date 

print -u2 " extracted files made \n _ 



» build the smart databases 

-/proiect/make.f iles/make_text_quot quotes/ / radish/ 070/ kim-emai i -abases S < dBase nameiquot 
-/pro-)«et/maKe_f iies/make„text_uquot -unquotes/ / radish/070 / mm-ema 1 i -doases . 5 dbase- nameiunquot 
-'pro:ect/make_f 1 ies /maxa.text _ sub 3 subjects' / radish/ 070 / kim-emai i -dcases , S i do*se_name | suds 



date 

print - -2 'smart databases made\n 



• run queries, generate data 
cd $( homedir 1 

query-irts S indexf 1 i ename quotes/ / radish/070/ kim-emai 1 -dbases S { dbase.name ■ quct . spec ful i -query-qucta-quot 
da t e 

print -u2 'full query qq made\n* 

make-score S indexti iename tui i -query -quotq-quct ful 1 -parent- ran* -quotq-qu~r 
recai 1-precision-taole full -parent - ranx -quo tq-quot r-p- taole-quotq-quo: 
chi ld-parent -ranx -score ful i -query -que tq-quot cprs-quotq-quot 
num-docs -retrieved ful i -query- quo tq-quot num-ret ri eved- quo tq-quot 
date 

print -u2 -^q fiies cample t ed\n * 

query- irts S indexf 1 iename quotes/ . radish/ 07C / kim-ema il -dbases , S t dcase.name > jr.wot / spec :ui 1 -query - c^c unquot 
date ~* " ~* 

print -u2 -full query qu madexn" 

make-score S mdaxf 1 lename ful 1 -query-quotq-unquot . ful 1 -parent - rank -quotq-unquot 
recal i -precis ion- tab le full -parent - rank -quotq-unquot rp- table -quo tq-unquot 
chi id-parent - ranx-score tui i -query -quo tq-unquot cprs -quotq-unquot 
num-aocs -retrieved full -query- que tq-unquot num-retr leved-quotq-unquot 
date 

print -u2 - qu files compieted\n- 

query-irts S indexf i lename- unquotes/ . radish/070/ kxm-email -dbases/ S (dbase.name jquot. spec ful i -query-u-quotq-quct 
dac e 

print -u2 -full, query uq madevn* - 
make-score S indexf 1 lename fui 1 -query-unquo tq-quot £ul 1 -parent - ranx-unquotq-quoc 
recal I -precision- table full -par ent - rank-unquotq-quot rp-tabie-unquotq-quot 
chi Id-parant -rank- score Cui 1 - que r^ unquotq-quot cprs -unquotq-quot 
num-docs- retrieved ful i -query -unquotq-quot num-retneved-unquotq-quot 
date f 
print -u2 "uq files corapietadxn" 

ouery-irts S indexf i lename unquotes/ ' radish/ 070/ kira-«ma i 1 -dbases / S ( dbase.najne t unquot.- spec tul 1 -query-unquotq-unquct 
date 

print -u2 -full query u made\n* 

make-score S indexf 1 lename Cull -query-unquo tq-unquot ful 1 -parent - ranx-unquotq -unquot 
recal 1 -precision- taole full -parent - rank -unquo tq-unquot rp- tab le-unquotq -unquot 

chi ld-parent - ran* -score ful 1 -query -unquot q-unquot cprs -unquotq-unquot — 

num-docs - retrieved full -query -unquotq-unquot num - ret nevad-unquo tq-unquot 

date 

print -u2 *uu files completed\n' 

query-irts S indexf 1 1 ename subjects/ / radi sh/ 070 / k im- emai 1 -dbases / S ( dbas •_n&mm \ 3U b) / spec ful 1 -guery-suD} 
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dace 

print -u2 "full query s raade\n* 

make-sccre 5 indexf i lename £ul 1 -query- sub j fu 1 1 -parent- rank-sub j 
recal 1 -precision- table full -parent-rank-sub j rp- table - sub} 
child-parent-rank-score ful 1 -query-sub j cprs-subj 
num-docs -retrieved fu i 1 -query-sub j num- retrieved- sub j 
date 

print -u2 " s files compleced\n* 
rm cprs . tmp 

sort -n cprs - quo cq- quo c > cprs. tmp 

rm cprs -quo tq -quo t 

mv cprs . tmp cprs -quo tq -quo c 

sort -n cprs- quo tq-unquot > cprs . tmp 

rm cprs -quo tq-unquot 

mv cprs. tmp cprs-quotq-unquot 

sort -n cprs-unquotq-quot > cprs . tmp 

rm cprs-unquotq-quot 

mv cprs . tmp cprs-unquotq-quot 

sort -n cprs-unquotq-unquoc > cprs . tmp 

rm cprs -unquo tq-unquc z 

mv cprs . tmp cprs-unquotq-unquot 

sort -n cprs-sufc] > cprs . tmp 

rm cprs-subj 

mv cprs. tmp cprs-subj 

date 

print -u2 "cprs files sortedvn" 

kimjoin cprs-quotq-quot cprs-quotq-unquot cprs-qq-qu 
date 

print -u2 "cprs files concatenated qq-qu\n" 
kimjoin cprs-qq-qu cprs-unquotq-quot cprs -qq-qu-uq 
date 

print -u2 "cprs files concatenated qq-qu-uq\n" 

kimjoin cprs-qq-qu-uq cprs-unquotq-unquot cprs -qq-qu -uq-uu 

date 

print -u2 "cprs files concatenated qq-qu-uq-uu\n " 
kimjoin cprs-qq-qu -uq-uu cprs-subj cpr s-qq-qu-uq-uu-s 
date 

print -u2 "cprs files concatenated qq-qu-uq-uu-s\n" 

finish-cprs cprs -qq-qu -uq-uu - s cprs-finai 
date 

print -u2 "cprs files completed\n" 

* now just a little cleanup 
mkdir cprs-concat 
mv cprs-qq" cprs-concat 
mkdir cprs-origs 

mv cprs-q" cprs-u" cprs-s* cprs-origs 

mkdir full-queries 

mv full-query-* full- queries 

mkdir num-rets 

mv num-retr leved • num-rets 

exit 0 
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• : /usr / sbin/per 1 
• 

ft usage: finish-cprs FILE OUTFILE 

! to run on an 11-column file of ranks, such as cprs-v will replace or with 
i 0.00 in the score column and the lowest rank from files that are hardwired. 

•9files=SARGV[0) . 

ef iles = 1 "nuffl-retrieved-quotq-quot" . -nura- retr ieved-quotq-unquot - . 

-num-retrieved-unquocq-quot" . "num-retrieved-unquotq-unquoc - , 
■num-retrieved-subj " ) ; 



foreach (9fii.es) ( 

if ( S_=-/ \-quotq\ -quot/ ) { 
Sfile="l-: 

) 

elsif ($_=-' N. -quotq\-unquot/ ) t 
Sf ile="3" ; 

> 

elsif ($_=-/ \ -unquotq\ -quot / ) ( 
Sf iie--5- ; 

> 

elsif ( $_=-/ \ -unquotq\ -unquot / i ( 
Sfil"e=-*>-; 

) 

elsif ($_=-/ \ -subj / ) ( 
Sf ile="9" ; 



open (MAXFTLE, S_) ; 
while i <MAXFI LE> ) { 

(Schi Id. Smaxval ) -spl it { / \ s/ . $_. 2 > .- 

chop [Smaxval I ; 

* print "SchiLdynSmaxvalVn" ; 

ft edataarray 1 joint - . " . $f ile. Schi id) 1 = Smaxval ; 

Sdataarray ( join ( - . - . $f ile. Schild) ) = Smaxval ; 

) 

C icse ( MAXFILE ) . 

) 

iforeach Sstutf (sort keys ( %dataarray) ) { 

• sans=$dataarray ( Sstuf t ) ; 

# print -Sstuf £ \tSana\n" ; 
*> 



#$ref = (1.3,5,7.9); 

open(CPRS. SARGVtOI) \\ die 'Couldn't open infile ; 
open (OUTFILE, ">SARGV( 1 ) ") |! die "Couldn't open out file- ; 
while (<CPRS>) ( 
Chop; 

s/nr /0\ . 00/g; 

ft (Schildparenc.Srl.Ssl.Sr2.Ss2.5r3.Ss3.Sr4.Ss4.Sr5.S S 5» S :split(.-s .Hi. 

@currentline = split ( / \s/ , S_. ID .- 
ft print -$childparent.Srl.Ssl.Sr2.Ss2.$r3,Ss3.Sr4.Ss4.Sr5.$s5\n 

( Schi idid, Sparentid) =split ( / \ . / . ^current line ( 0] .21: 
« print "Child: $childid\ t Parentid: Sparent id\n- ; 

foreach Sref (1. 3,5.7. 9} { 
ft print -currentlinelSref J=«currentline(Sref ) \n" ; 

if <9currentline(Sref]^-0.00") I _ 
. ecurrentlinetSref)="Sdataarray( joint .'.Sret. .childid, n .- 
. print "Replacing column Sref of S_ with Sdataarray! joint . „ re £ . Schi Idid) ) \n 



print OUTFILE joint "\t" . ecurrentiine) \n" ; 
ft print -$childparent.Srl.Ssi.Sr2.Ss2.Sr3.$s3.Sr4.$s4.$rS.SsV.n" 

) 

close(OUTFILE) , 
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K !/usr/s bin/ perl 
n 

* usage: gee-score FILE 

4 

$ this is called on by script make-score-list, to get the message numbers of 
» each child and parent, and the corresponding rank of the parent 

N 

Sstate-0 ; 

while (<>) ( 

if (SscatesaO) ( 

if (/Message * ( \d+) : / ) ( 
Schild=Sl; 
Sstate= 1 ; 

) 

) 

elsif (Sstate==li ( 

if (/Message *(\d*):/j ( 
Sparent-S 1 ; 
&pr int_results ; 

J 

) 

1 

sub pr int_resui ts ( 

3score = i f ind_score< Schi Id. Spar en t ! . 
if (C5core==nti: ( 

Ssccre- "nul I score"; 

J 

elsif <Sscore>10) ( 

Sscore="not cop 10"; 

) 

print - Schi ld\c Spar ent \ t Sscore\n" : 
Sstate=0 ; 

) 

sub find_score i 

I open ( INFILE . -/ home /kknowles/ project/ stats /wwwtalk/drun- children * J |] die -Can't find file- 

open ( INFILE, ■ /home/ kknowles/ pro ject /stats /wwwtalk /query- by -quote- icng- s l| die "Can't rnd file" 
Sscate^O; 

while <<INFILE>) ( 
• print " S_\nSrank \n" ; 

if (Sstate==0) i 

if </0*Schild/> ( 
Sstate=i; 
Srank=0; 

) 

I 

elsif (Sscate==l) { 

if l/^s-Sparencvs/) { 

ls r *\s*Sparent\s*//] ; 
return Srank, 

elsif (/"No quoted material/) { 
Srank-"No quoted material.", 
return Srank; 

) 

elsif (.'"No documents to display/) ( 
3rank="No documents to display."; 
re r urn Srank. 

i 

Srank** ; 

> 

else { 

die "Shouldn't get here* ; 

} 



t 
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» ! /usr/sbin/per 1 
» 

* usage: kimjoin file! file2 oucfile 
$ 

d takes two files, sorted numerically, ot data ot form 
I <childid> . <parent id> \t ( <rank> \ t<score>)x N 

* and joins it the right way, with 

* <childid> . <parencid>\t (<rank>\c<score>ix Nl \t t <rank>\ t<score> I x N2 

« replacinq the right number of fields with nr when there was no data from 

* the file. 

open (OUTFILE. " > S ARGV 12]"); 

Open ( FILEA, SARGVfO) ) ; 
open ( FILEB . SARGVfl} ) . 
Sa=<FILEA>: 
Sb=<FILEB>; 
chop t Sa ) ; 
chop { $b> ; 
Sanr=Sa; 

( Sac . Sanr ) - spl i c ( .' \ s / . $a . 2 ) ; 
$ an r = - s / I ~ \ 1 1 ♦ / n r / g . 
•Sanr=-s/ t " \ t ) * / 0 \ . 00 /g , 
Sbnr=Sb; 

{ Sbt . Sbnr) =spl i c ( / \ s / . Sb. 2 ) ; 
Sbnr=-s/(~\t| * / nr/g . 
#Sbnr=-s-' l*\tI*/0\.00/g.- 
Kprint * Sanr \ tsbnr\n * ; 

while (U ( 

if iSa = = -" Sb-""! ( 

exit. 

) 

I Sal . Sa2> =spiit I / \s/ . Sa. 2 ) ; 
(Sbl , Sb2 ) =split ( Ms.' , Sb. 2 > : 

* print "dl =Sal \ tbl=$bl\n" . 

* print "a2 = $a2\tb2=Sb2\n" .- 
if (Sal -= Sbl) ( 

if (Sal eq Sbl} ( 

print OUTFILE " Sal \ c$a2 \ t Sb2 \n - 
Sa=<FILEA>; 
Sb=<FILEB>; 
chop < Sa) ; 
chop ( Sb) . 

) 

elsif < lenqth( Sal )< length ( Sbl ) ) ( 

print OUTFILE ~ Sal \ tSa2 \ t Sbnr \n 
Sa=<FILEA>. 
chop ( Sa ) ; 

) 

elsif ( 1 ength t Sa 1 J >length ( Sbl ) ) { 

print OUTFILE " Sbl \ t Sanr \ tSb2 \n 
Sb=<FIL£B> ; 
chop (Sbl ; 

) 

i 

elsif (((0<$al( it (Sal < Sbl)) || (Sbl 
print OUTFILE " Sa 1\ t Sa2\ t Sbnr\n" , 
Sa=<FIL£A>; 
chop ( Sa ) ; 

) 

elsif (((Sal > Sbl i ) (0<Sbl) |( (Sal 

print OUTFILE - Sbl \ tSanr \ tSb2 \n" : 
Sb=<FILEB> ; 
chop(Sb) . 

I 

else I 

die "shouldn't get here"; 

) 

) 



==•■)) ( 



■ I ) ( 



Close (OUTFILE) ; 
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» ! /usr/sbin/perl 
if 

* usage: make- score KEYFILE DRUN.FILE OUTFILE 
« 

* Cakes keyfile index of child-parent message numbers, file of output from runninq 

* smart, and oucfile. prints out childnum. parencnum, rank the parent came back at 

» in outfile. if parent not returned, says not in top Sn * . where Sn is the number 

* of documents returned. 

* taken from -/pro j ecc / bin/ make-score- list and get-score, adapted for better 

¥ I/O Stuff, 

Skeyf ile=SARGVlO] ; # file which contains desired child-parent pair by msgnum 

Sdrunf ile = SARGV( 1 J ; * file which is the output of running smart 

Soutf ile=SARGV(2 1 . * output file 

« this to form Schild Sparent data array 
open ( INFILE , Skeyfilel, 
while {<INFILE>) ( 
Chop; 

push(9pairs . S_) ; 

> 

close UNFILE) . 



open {OUTFILE . ">Souc£ile"i )j die 'Couldn't open outfile". 
open { INFILE. ^druntile) || die "Couidn't open infile"; 

Srank- -Rank" ; 

rtprint OUTFILE "Chi ldID\ t PPar ID\ tRank\n ' ; 

LOOP ARRAY : 

foreach (epairsi ( 

print OUTFILE " Schi ld\ t Sqparent \ t SrankVn" unless (Srank eq "Rank"). 

&new_pairs ; 

if (Stinp=-/$child\ . Iquot Jucjuot j subj ) /> t 
Sstate=l, 
$rank=0; 

> 

while t <INFILE> ) i 

if lSstate==0) ( 

if < /Schild\ . (quot | uquot | subj I / ) ( 
5scate= 1 ; 
Srank=0 : 

) 

) 

elsif (Sstate==l> ( 

if i."\s*Sparent\s/) { 

is'" \s*Sparenc\s+ / / i : 
next LOOPARRAY; 

elsif (/"No quoted material/ ) ( 
Srank=-No Quoted material."; 
next LOO P ARRAY ; 

elsif (/"No documents co display/ ) { 
Srank="No documents to display.**; 
next LOOPARRAY; 

elsif ( '0\d\d\d\d\ . (quot | uquot | subj | / ) { 

5rank=5rank-l; * correction for the line itself beinct counted 

Srank='Not in top Srank.", 

Stmp=S_; 

next LOOPARRAY. 

S rank* * ; 

) 

else I 

die "Shouldn't get here". 

) 

) * " 

> * " 

print OUTFILE - Schi ld\ tSqjaarent \tSrank\n* ; 
close (OUTFILE) . 

f 

sub new_pa:rs { * assigns each pair to Schild. Sparent 

* (Schi la. Sparent > = split ( /\t\t/ . S_. 2) ; 

( Schi Id. Sparent ) = spl i t < / \ t/ . S_. 2 ) . 

Sqparent = Sparent ; 

Sparent = -s/"0*// ; 
» print -Read new pair Schild. Sparent\f; 

Ss tat e=C . 
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* ! /bin/ksh 
i 

» make-score-list DIRECTORY > OUTFILE 
* 

pre£ix=Sl • directory of files which are mdiv. dacasets 

•print S(prefix) 

•print "Child\ tParent \ tRank\n" 

for file in Stprefix)* 

do 

get-score Sfile 
done 
exit 0 
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» ! /usr/sbin/per 1 
» 

* usage: num-docs -retrieved XNFILE OUT FILE 

* goes through a full-query-* file and prints to outfiie che number of 
» documents retrieved for each child. 



open (FULL_QUERY. SARGV{0|l; 
Open (OUTFILE. " > SARGV 111'); 
whi ie < <> » { 

if {/W0*(\d«-l \ . ( quoc | unquot | sub j | mbox ( / ]| eof) { 

print OUTFILE " Schi ld\ tSnum\n" unless $child==--; 
Schild^Sl; 

) 

elsif </-N/) ( 
Snum=0 ; 

) 

else { 
) 

> 

close ( FULL_QUERY) ; 
close (OUTFILE J ; 
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1 




i smartbatchl PREFIX SPEC NUMOUT > OUT 

f 

• 

l set -x 

prefix=Sl » directory of files which are indiv. messages 

spec~S2 » location of spec file 

numout = S3 ■ number of records per query 



for file in ${pre£ix>* 
do 

rrn / tmp/ cmp . smartbatchl " 
echo Sfile 

cat Sfile 1 roailbodyl > / tmp/ tmp. smart batch 1 -data 

cat / radish/ 07 0/ tmp/ querypref ix / tmp/ tmp . smartbatchl -data / radish/070/ tmp/querysuf fix > / tmp/ cmp . smartbatchl -query 

cat / tmp/ tmp . smartbatchl -query | smart inter Sspec verbose 0 num_wanced Snuntout 
done 
exit 0 
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#!/usr/sbin/per- 
* 

# Usage: cesr. -cr.i ■ dren PREFIX OUTFILE NUMOUT 
* 

H model after maxe- score- 1 ist and get-score, excepc we wane it all in perl, so 

* we can recognize che filenames as well. 
I 

# okay, tirst we want to grab everything in a file. i think we can pipe in 

• filenames fro- shell command find to open (FILEHANDLE. "find <blah> | " 1 or 
if something. 

* then we want open each one and scan to see one of those things. 

# then we want :c print out what the final status was. childid and T or F. 
tf 

* we can use return ^variable for passing values to and from subroutines. 

•open (OUTFILE, - >>SARGV ( 1 1 • 1 ||die "Couldn't open outf ile" ; 
open (OUTFILE. ** >SARCV f 1 J " ) lldie "Couldn't open out file"; 

print OUTFILE "Message ID\ tChi ld?\n" ; 

•open (FIND, -fir.;* SARGVJO] -type f -name **.mbox' -print i head -SXRGV[21|'! II die "Couldn't run find: S'\rv 
opentFINO. "find SARGV(O) -type f -name J *.mbox* -print | " J |[ die "Couldn't run find: S!\n"; 

• che second one does it for all. the first for a specified number of messages. 

while ( $ f i 1 ename=<FIND> ) { 
Smsgid=&get_chi iz_id t 5f ilename) ; 
$ answer =iamich: < C fi 1 ename) ; 
print OUTFILE " ir.sgi d\ z \ t Sanswerxn " : 

) 

• we can try tc -xtract the actual child id from the filename, i guess, 
sub yet_child_id ; 

3£ ilename =- I G \d~ i \ . mbox/ ; 
return $1; 

) 

• used to search :or quoted material 
Spat 2 = "( fA-Za-z ;•>*)" . 

Spat3 = " (W I ' : 

Spat 4 = " I | A-Za-s ; * \ \ | + 1 " ,* 

Spat = "Spat2 | S = iz3 |Spat4 " . 

sub amichild { 

• takes a file ds an argument, looks for In-Reply-To: header. Re: in Subject 

• header, and cju::ed material. If all are absent, returns faise. else. 

• returns true. 

open (MSGFILI . S f i ienarae I |l die "Can't open file"; 
while (<MSGr:^E>> ( 

if (/*:.-.- Sep: y-To :/ 1 ( 
return "T" ; 

) 

elsif * "Subject: Re:/) { 
return "T" ; 

) 

i* we don't seer, zo need to scan for quoted material. i think the subject line 

# containing the Re: is a really robust example. we'll leave it for now. 

* elsif (*■>/>( 

# i t I / * > ) ( 

4 return " T* ; 

• ) 

*t eisif ; " ' ^ s* J 1 Spat J ( Spat | ( \s ) ) * / ) { 

* return "T" ; 

* ) 

return "F": 

) 



T 
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I /usr/sbxn/per 1 

Usage: test-irt PREFIX OUTIDS OUTNAME tNUMOUT] 

meant to create two files: one for outputtmg the childid's and ppid's of 
S£ S X??dT^ """T^' S 10 Che ^'"ply-to header, and one containing 



okay, the way this wiil work is first we will open two outfiles 
print the headers for each file, 
then we will open each infile. 

if it contains their: field (optionally we can save the line to a variable* 
we want to see if it contains the pattern / I < | A-Za -z0-9 \ . ) *9 t A-Za-zO-9 )*>)/ 
X it iL WS W ? nC za s .°t^ m that S1 ™* 1°°* through all the other messages 

ihe current £Ue " = then we want to get tne n^er'of 

then print childid and ppid. 

then we need to close the child file and do the same for the next. 



if < SARGV ( 3 ) = = - - ) { 
SARGV(3 ) ="3000" ; 

) 

opentOUTIDS. ->SARGV(i;m | | die "Couldn't open outfile" 
print OUTIDS "Message IO\tPParent Id\n" 

open ( OUTNAME , ->SAAGV|:.'-| ||die -Couldn't open outfile" 
print OUTNAME "Message :o\tPParent Id\n": 



•open< INFILE. -/bin/ is -1 SARCVfOl (head -SARGVI3) 
» this one (above i goes through in forwards order 
• open t INFILE . -find SARGVJOI -type f -name • • .mbox 

opendir ( INDIR . SARCV J 0 1 - 
<?inf i leereaddir ( INDIR j 
closedir ( INDIR) ; 



i"» | | die "Couldn't run li;: S!\n". 

' -print I head - SARCV I 3 1 \ ~ \ jl die "Could.-. 



•whi le 1 S f i lename=< INF:LE> ) < 
fore.-ch (Ginfile) { 

- :ilename*S_, 
*» print Sfilename: 
if (Sscatestirtl { 

Srasaid=iget_msc_idlSf ilename) ; 
if (Sscate==l! ' 

Sanswer=if:r.2j3arent_id (Sfilename, Sparentidi ; 
print OUTITS - Smsgid\ t \ tSanswer \n- ; 

) 

eisif CSstate==2) ( 

Sanswer = l« r i-d_parent_name ( filename) ; 
print OUTNAME - SmsgidN t \ tSanswer \n - ■ 

1 

> 

1 

» we can try co extra;: tne actual id irom the filename, i ouess 
sub get_msg_id ( 

local { S tname) = £_ ; 

Sfnaree ~~ / ( 0 \d* i . . rtuaox ' ; 

return SI; 

) 



sub irt t 
S/="- ; 
S * = 1 ; 

• print SARGV[0| . rfilenajne; 

open.MSGFILE SARGViOl . Sfilename. || die -Can't open file S f i 1 ename : S ! ' 
while { <MSGFILE> ) 

if (/"In-Reply-To: ( A <| • (< (A-2a-z0-9\ - J ( 9 , 100 J @ ( A-Za- zO - 9 \ -»**>/» I 
Sparent id= ; 1 .* 

• print "Pare" 10= Sparentid*. 
S/=" \n-, 

S * =0 ; 

close IMSCFILEl ; . * 

return - 1 ■ 

I 

elsif ( /"In-Re-ly-To: < . * j / > { 
3 irt_text=3 : ; *■' 
S/=- \n" , 

S-=0: T 



closetHSGr ILE) 
return "2* 



) 

S /= -\n" ; 
S* = 0; 

close(MSGFILE) . 
return "0" ; 



sub f ind_parent_ id { 

» open<PAR. "/bin Is -1 SARGV(O) | head -SARCV[J)|-, () die "Couldn't run find- S * m* ' 
• while IScurrent=<?AR>) t 
foreach (Qinfilel •. 
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openlENT. SARGVfOJ . Scurrencl II die "Couldn't open parent tile", 
while <<ENT>) ( 

if < / "Message- Id : Sparentid/) { 
ciose(ENT); 

* close ( PAR ) ; 

return &get_msg_id t Scurrent ) ; 

I 

) 

close(ENT) ; 

) 

t close ( PAR) ; 

recurn Sparentid. 

) 

sub £ind_parent_name { 

» okay, so now we need to do two things: collect appropriate info from the ire 

* line in the child, and then match the parent that those things fit. 

n i think we should get a list of parents that satisfy those conditions (the 

n exact time being a stronger statement than the date I and find where the arrays 

* intersect. we may need only the time to match up. 

* &get_t ime_o f_chi ld; 

* L f ind_parents_wi th_t ime; 

return "does not contain id number"; 

> 

Spat 2^- t \d(2 . 21 : \d(2. 2 I : \d(2. 2) ) " : 

sub get _t ime_o£ _.cni Id ( 

it (Sirt_text = - /(Spat2l/» { 
Schi ld_t ime=S i ; 

) 

) 

sub f ind_parents_wi th.time { 
if (Schi ld_time) ( 

* open<PAR. -/bin/Is -1 S ARGV [ 0 ] (head -SARGVl 3 ) | " > || die "Couldn't run find: S!\n", 

* open(PAR. -/bin/Is -1 SARGV(0]|") || die -Couldn't run find: S* n-. 

* while ( Scurrent=<PAR> ) ( 
foreach (9infile) ( 

S f i lename- S_. 

open(ENT, SARGVJO] Scurrent) j| die "Couldn't open parent file"; 
while ( < ENT> ) { 

if (/"Date: . Schi ld_t ime/ J { 
ciose'tENT) ; 

* close(PAR) ; 

return tget_msg_id($currenc) ; 

/ 

> 

close (ENT) ,- 

) 

* close ( PAR) ; 
return $child_time; 

) 

else { 

return "no time provided". 

1 

> 

sub get_dat e_of _chi Id ( 
) 

sub f ind_parents_wi th_date i 

¥ Spatl-- < (A-Z| (a-z\ . J * (A-Z )?fa-z\.)* t A-Z.I (a-z 1 M " . 

* Spat3 = M - ; 

* open ( PAR . '"/bin/ Is -1 SARGVfC! (head -SARCV'3}j") || die "Couldnr. run find: S!\n"; 

* while ($currenc=<PAR>) ( 

« opentENT. SARGV(0| . Scurrent) || die "Couldn't open parent file";. 

* ©months = ('Jan*. 'Feb' , ; 'Mar' , 'Apr', 'May'. 'Jun' . Jul- / ' Aug-' . 'Sep'. - 0ct'. 'Nov 

* whi le ( <£NT>) ( . 

* foreach Si (0 . . Slmonths) ( 

* if ( / Date :^ Smonths I i I / ) ( 

* Si**; 

* > * 

* ) 

* ) 

« c lose (PAR); 

* return Sparentid; 
) 
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» ! /usr / sbin/per 1 
f 

» usage: cest- irt-better PREFIX OUTIDS 

I meant to creace a file for outpuccing the childid s and ppid's of 

• messages which have message- id's in the in- reply- to header. 
• 

• if ISARGV(2] = = --) ( 

V SARGVI 2 1 ="3000- , 

«) 

opendi r ( INDIR . SARGV [01), 
einf ile=readdir tINDIR) ; 
closedir (INDIR) ; 

•this section to make the array of fileid's and Message-Id's 
foreach Sfilename (Ginfilei ( 
» print -Sfilename\n- ; 

open (MSGFILE. SARGV t 0 ) . Sfilename) || die "Can't open file Sfilename: S! ; 

SCilename»-/ (0\d»l \ .mbox/; 

Sparent=Sl ; 

while t<MSGFILE>) ( 

if (/-Message- Id: ( < ( A-Za- z0-9 \ . - 1 { 9 . 100} 9 ;A-Za-zO-9\ .-!♦»)/> t 
Smsaid^Sl; 

Sdata="Sparent\ tSmsgid" ; 

I print "Sparenc \ tSmsgidNn" ; 
push (Gmessageids . Sdata ) ; 

) 

) 

close <<MSGFILE> \ ; 

) 

•print "flinfile" .- 

ithis section to make the array of fileid's and Message-Ids 
foreach Sfilename I9infiie) ( 

« print "S_^n-. ; " „ , - , 

open(MSGFILE, SARCV(0| . Sfilenamel j| die "Can't open file Sfilename: S! 

S t ilename=-/ (0\d* ) \ .mbox/ . 

Schild^Sl; 

II print -SchildNn"; 
while (<MSGFILE>I ( 

if ( /-In-Reply-To: [*<]•(< I A-Za-*0-9 \ . -1(9- 1 00 > <? I A- Za-zO- 9 \ .-]*>>/). { 

S parent id=Sl ; 

Sdata=" Schild\tSparentid" ; 
H print "Schild\ tSparent id\n - . 

push (girt ids . Sdata 1 .- 

J 

) 

close (<MSGFILE> ) : 

} 

,now we try to match the parentids of Qirtids with the msgids of ymessageids 
foreach iGirtids) ( 

{ Schild. Sparentid) = spli t ( / \ t / , S_ , 2 1 ; 
foreach ( flmessageids ) ( 

( Sparent . Smsgid) = spl i t ( / \ t / . $_, 2 ) ; 
if (Smsgid eq Sparentid) < 

Spair^- Schild\t\ tSparent" ; 
pushteanswers . Spairj ; 

) 

) 

) 

Know print out array in outfile. 

open<OUTIDS, ">SARGVIi|-i ||die "Couldn't open outfile"; 
•print OUTIDS "Message ID\tPParent Id\n"; 
foreach (©answers) { 

print OUTIDS -S_>n"; 

) 
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Oavid 0. Lewis and Kimberly Knowles. 29-May-96 

forIlI- 8/1-7/95 does something for all files in current directory. either does 
che command for each file, or replaces each file for *R' in command. (use R 
instead of * ) 

$ usage: forall "COMMAND PARAM" 

4 forall "COMMAND RE PL REPL 



geomean: 8/14/95 calculates the geometric mean for set of data in array, of form 
cprs-final, 

K usage: geomean FILE > OUTFILE 

! to run on an 11-coiujnn file of ranks, such as cprs-final. will calculate 
* geometric mean for each child and parent pair. 



rand-selector: 7/10/9S what you think it does. takes args maxxmm and numwanted 
and gives back a list of random numbers. run a one-lxner from :he camel book 
to cut out the repetitions. 



stdevchiidparent : 7/17/95 finds ave and standard dev ot a hardwired list of 
numbers, and prints it. 
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• ! /usr / sbin/perl 
« - 

# usage: torall "COMMAND PARAM" 

* forall "COMMAND REPL REPL * " 



opendir ( INDIR. • . - ) ; 

9a 11 £ i les-readdir ( INDIR ) ; 
closedirt INDIR) ; 

* cheap hack to eliminate ./ and ../ entries 
shift (0allf iles) ; 
shift <0allfiIes» : 

foreach (eallfilesl { 

i £ (SARGV(01 =-/REPL/ ) ( 

SARGVrOl =-s/REPL/S_/g; 
print *$ARGVtO] * ; 
SARGV(O) =-s/S./REPL/g; 

) 

else ( 

print * SARGV I 0 ) S ' ; 

) 

) 
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# ! /usr/sbin/per 1 

* usage: geomean FILE > OUT FILE 
« 

« to run on an 11-column file of ranks, such as cprs-finai. will calculate 
« geometric mean tor each child and parent pair. 

openlCPRS, SARGVfOl). 
while (<CPRS>» { 

Sprod=l; 

chop; 

s/nr /0\ . 00/g. 

©current line«spl it </\s/,$_,ll); 

<$childid. Sparentid* =spli t < / \ . / , Gcurrentline (0 ] , 2) ; 
foreach Sref (1,3.5,7,9) { 

if (9currentline[ Sref ] = -/n/ ) ( 

Stmpse cur rent line (Sref ] : 

Stmp=-s/n//; 

$tmp= < Stmp+1 ) /2 , 

} 

else ( 

Stmps9currentline(Sref ] ; 

) 

Sprod=$prod'Stmp; 

} 

Sgeomean-Cprod* • . 2 ; 

print "<? cur rent 1 ine ( 0 ) \ tSgeomeanVn*' ; 
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* ! /usr/sbin/perl 

t usage: rand-selector NUMW ANTED MAXIMUM > OUTFILE 

* example: rano-seiector 5 10 > tmp 

• 

srandl c ime| SS ) , 
Si=9ARGV|0) ; 

while (Si>0> ( 

Sa=int ( rand(9ARCVf 1 I - 1 ) ) ♦ 1 ; 
print "So\n-; 
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• * /usr/abin/par 1 

• usage. std«vchildparanc 
■ 

• chis should cowpuct ch« standard d«v ot a lisc givan xt 
• 

•*daca-<2. 4.2. 7. 7. 2.2. 2. 10. 2. 4. 2. 2. J. 2. 3. 2. 2. 2. 6. 2. 2. 2. 4, 2. 3). 
*«d*t*-f2.4. 2^^. 2.2. 2. 10. 2. 4. 2. 2. 3. 2. 3. 2. 2. 2. 6. 2. 2. 2. 4. 2, 3. 921; 

t«d*e«-(2. 2. 2". 4. 4. 2. 2. 2. 3. 3. 2. $.3. 10. 3.2. 7. 2. 2. 2. 2. 3. 2. 5. 2. 2. 5. 2. 2. 2. 3. J. 2 2 3 3 

2.2,2.3 . 4.4.2 10. 2 . 2. 4. S. 6. 2 . 2, 6. 7. 2. 3. 3. 3. 2.2.2 2l 
<*<lAt«« (2.2.2. 4. 4.2. 1.2. 3- 1 72 . 3 . 26 . 2 . 8 . 3 . 10 . 3 . 437 . 2 . 7 . 2 . 2 . 12 , 116 . 2 . 2 . 3 . 2 . S . 2 . 2 . 13 
5. 2. 2. 2. 3. 3. 2. 2. 3. 3. 2. 2. 2. 3. 4. 26. 4. 2. 10. 2. 2. 4. 5. 6. 2. I 

* 8.2.69.6.7.2.3.5.3.2.22.2.2.2.2883); 
•«daca- ( 1 . 2 . 3 ) ; 

Sl«n««daca: 
•print Slan; 
Si«Sl«m-l; 
Sau»*0; 

whil« <Si>-W { 

SBum* S sum* S data ( S i ) . 

Si--; 

) 

Si**; 
Sv*r«0; 

whila (Si<Sl«n) ( 

Sd»$d*ta(Si ) -Sav«. 

• princ *Sd\n". 
Svar«Sv«r* ( Sd*Sdl . 
Si*-: 

) 

Sv«r»Svar/ (Sl«n-l » ; 

Sscd«v»sorc(Svar ) ; 

princ "Av«rag« * S*v«\n a . 

princ * Standard ©aviation = Sstd«v\n* ; 
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(28 -May- 9 6 . Additional notes on how SMART was used to na:c. k . portions 
ot child messaqes with potential parents These notes supplement 
the description in 

«article(LEWIS96e 

. author = 'Javid D. Lewis and Kimberiy A. Knowles' " 

• title - -Threading Electronic Mail. A Preliminary Study' 
. journal - 'Submitted for publication. Comments welcome ' 

• /ear = 1996 
Lopyrighcown = ' ' AT&T 19 96*' 

1 



7/6/95: This is an early study of the messaqes that were picked out by smart 
on a. randomly chosen U picked a number out cf my head) article. This used 
the connect ionises database, in ' radish/070 / net 1 i sts/ archives / connect iom sts/ 

observation: the numbers thac smart assigns seems to be (at least in the early 
articles) just an inaex of :ne articles in the order it took them. this is 
true at least within the first archive file. ; haven't broken up the messages 
into individual files yet. 

I selected 123 to use as a query. (remember that the qiaery commands are cast- 
sensitive.) The subiect line is 'Kolnogorovs superposition theorem". 

Here are the smart, results. 

Smart (ntq?) . Drun 2j 



Action Sim Tide 

Kolmogorov- s superposition theorem 

Kolnogorovs superposition theorem 
papers on pattern recognition 
NIPS CALL FOR PAPERS 

Bi-monthly Reminder file in Inoox. Composite networks TR • Conn 
CALL FOR PARTICIPATION 

Bi-monthly Reminder file in Inbox. Delays in Neuroprose Putting 
What is a connect ionist net? Here's what it's not. 
NN conference. ROOMMATES FOR MIPS 89? CFP. Parallel Camput 



23 


1 


. t>a 


4 J 


0 


. 28 


26 


0. 


27 


246 


0 . 


24 


59 


0 . 


23 


303 


0 . 


23 


::o 


0 . 


23 


32: 


c. 


23 


124 


0. 


22 


221 


0. 


22 



:mg s 



I will include the full text of the top three messaqes below these 
observations . 



•23 header inf o : 

Date: Tue. 17 Jan S9 14:06:03 EST 
From: r, ontaqG f erma t . r u t gers . edu 

Message- rd: <fl 901171908. AAO0 9 6 4<3cont r o 1 . rutgers .edu> 
To. Connect ioni stsScs . emu . edu 
Cc- ruit?zeta . rice . edu 

Subject: KolmoQorovs superposition theorem 
Reply -To : soncagS f erma t .rutgers . edu 

#43 header info. 

Date: wed. 25 Jan 39 17.34:38 CST 
From: Rui DeFigueiredo <ruiC*rice.edu> 
Massage- Id: < 8901252 3 3 4 . AA0 1 8046 z eta . rice .edu > 
To: Connectionists9cs.cmu.edu 

Cc: bradleyefermat.rutgers.edu. poggioewheaties.ai.mit.edu, 
sontagSf ermac . rutgers . edu 



tn-Reply-Tc- : poqg i c^wnea t les . a i . mi t . edu ' s message of Tue 1"* Jan 89 2" 1 
47 : 17 EST 

Subject Koimoqcrcv.'s superposition theorem 

« 26 header info. ; 
Date: Tue. 17 Jan 89 2 2 ; 4 7 : 17 EST 

From: poggiotfwheaties.ai.mit.edu (Tomaso Poggioj 
Message-Id: <8901iSG347 . AA2 10689r xce-jchex . ai ..■nit.edu> 
To: sontaqefermat.rutgers.edu 

Cc : Connectionistsecs.cmu.edu. ruij3zeta.rice.edu 
In-Reply-To: sonr.aqft f ermat . rutgers . ecu * s message of Tue 
Subject: Kolmogorov* s superposition theorem 

Preliminary observations show that smart did very well with these particular 
messages. I however, have r.ot searched to see if there arc other messages it 
should have gotten, thouqh it seems implausible, as I have egreped the entire 
archive for "Kolmogorov I inserted below[. The tree ot parent / chi l<i 
relationships is as follows 
23>26>43. 

Thus, smart got r.har chey are interrelated, but ranked the 

grandchild above the child. However, the appropriate information can be gotten 
by usinq -just the message id. The Subject: header is reliable only in content, 
no Re. flaqs were inserted at all The dates alone could have shown me 
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correct order. 

observation: in investigating whether or not smart was able to build the entire 
archive before radish crashed, I came upon the following problem. There is 
indeed a message that contains "'From my point of view as a connect loni st . " 
which is mistakenly parsed as a new message. Furthermore, in that same 
archive, arch. 9 106. the messages following that false new message are missing 
the pinhead headers. Thus, smart thinks its one big message for who knows 
how many more messages. I doubt that the actual case is that this one message 
actually encapsulated so many forwards, whether it's an error in the archive 
or the smart parser. I don't know, but clearly perhaps the pinhead header is 
not as reliable as we once thought. 

okay, i've found it out. smart did get to finish, as the last message in the 
archive is in the smart database. however, it is the nth message that got 
concatenated together (without the pinhead to separate them) in message 382 



titles 



Num Action Sim 


382 


• 0 . 24 


329 


0 . 19 


278 


0 . 16 


105 


0. 16 


162 


0. 16 


30S 


G . 15 


3S1 


0.14 


324 


0.14 


174 


0. U 


364 


0 . 14 


i 3 i-is c 


used the 



ph.d. thesis available: NN for Signal Processing NNSP95 - Forma 
TD-Gammon paper available in neuroprose Brisbane Neural Network 
Informal Computing Workshop Announcement: Conference for Phiios 



Why does the error rise in a SRN? Special Issue on Neural Model 
Copernicus project with Central and Eastern European Countries 
policy on posting talk announcements European Society fcr Philo 
Share hotel room at IJCNN? 



history. 



Dtext 23 

From Connectionists-RecAiest80.CS.CMU.EDU Tue Jan 17 14:43:00 1989 
Received: from Q . CS . CMU . EDU by B .GP . CS . CMU . EDU : 17 Jan 89 14:42:24 EST 
Received: from CS.CMU.EDU by 0 . CS . CMU . EDU; 17 Jan 89 14:38:24 EST 
Received: from FERMAT RUTGERS . EDU by CS.CMU.EDU; 17 Jan 89 14.37:28 EST 
Received: bycontrol.rutgers.edu { 5 . 59 / ( RU-Router / 1 . 1 J / 3 . 01 > 

id AA00964; Tue, 17 Jan 89 14:08:03 EST 
Date: Tug, 17 Jan 89 14:08:03 EST 
From, sontag9fermat.rutgers.edu 

Message-Id: <89C 1 17 l 908 . AA00964©cont ro 1 . rucgers . edu> " 
To: ConnectionistsScs.cmu.edu 
Cc: ruiQzeta.rice.edu 

Subject: Kolmogorovs superposition theorem 
Reply-To: sontaggf ermat . rutgers . edu 

I am posting this for Professor Kui de Figuereido, a researcher in Control 
Theory and Circuits who does not subscribe to this list. Please direc; 
cc's of ail responses to his e-maii address (see below) 

-eduardo s . * • • 

K0LMOGOR0V2 SUPERPOSITION THEOREM AND ARTIFICIAL NEURAL NETWORKS 

Rui J. P. de Figueiredo 
Dept. 2f Electrical and Computer Engineering 
?:ce University, Houston, TX 77251-1892 
e-maii : rui6zeta.rice.edu 



The impiementac icn of the Ko lmogorov-Amold- Sprecher Superposition Theorem 
[1-3 J in terras cf artificial neural networks was first presented and fully 
discussed by me in 1980 (4J. I also discussed, then (41, applications of 
these structures to statistical pattern recognition and image and multi- 
dimensional signal processing. However, I did not use the words "neural 
networks' in defining the underlying networks. For this reason, the. current 
researchers on neural. nets including Robert Hecht-Nielsen [5|„d©>hot seem to 
be aware of my ccntribution (4). I hope that this note will help correct 
history. 

Incidentally, there is a misprint in '[41. In (4). please insert "no" in 
the statement cefore ecpi.14). That statement should read: "Sprecher showed 
that lambda can oe any nonzero number which satisfies no equation . . - 

(11 A . K . Ko lraogcr :v. "On the representation of continuous functions of several 

variables zy superposition of continuous functions of one variable and 

addition. - Zckl . Akad . Nauk . SSSR . Vol .114, pp. 369-373, 1957. 
(21 V.l.Arnol'c. "On functions of three variables,'* Dokl . Akad . Nauk . SSSR 

Vol . 114 , pp. 533-956. 1957 . 
13) D A. Sprecher . "An improvement in the superposition theorew of Kolmogorov.- 

J . Math .Anal . AppL . , Vol . 3 8 . pp . 208 -213 . 19 72 . 
[4] Rui J . P . de rigueiredo. "Implications and applications" of Kolmogorov ' s 

superpos i 1 1 zr. theorem. - IEEE Trans . Auto .Contr . .Vol. AC -25, pp. 1227-1231, 1980. 
(51 R . Hecht -N i el sen . 'Kolmogorov's mapping neural network existence theorem." 

IEEE 1st Ir.-.Cont.on Neural Networks. San Diego , CA. June 21-24,1987 paper 

III- 11 . 
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Dtext 42 

From Connect ion: s cs -Requesc90 . CS . CMU. EDO Wad Jan 25 18:53:58 1989 
Received: from CS.CMU.EDU by B . CP . CS . CMU . EOU; 2 5 Jan 89 18:S2:40 EST 
Received: from CS.CMU.EDU by ?.CS. CMU. EOU: 2S Jan 89 10:47:04 EST 
Received: from SICS.EOU by CS.CHU.EDU: 25 Jan 89 18:45:19 EST 

Receives: from zeca.nce.wlu by rice.edu IAA29S05I: Wed. 25 Jan 39 17:44:04 Z2Z 

Received: by zera.rice.edu (AA01804); Wed. 25 Jan 89 17:34-38 CST 

Dace: Wed. 2 5 Jan 39 17=34:38 CST 

Frcra: F.ui DeFigueiredo <rui0rice.edu> 

Message-Id: <89012523 3 4.AA01804O2eca.nce.edu> 

To: Connecrioniszs9cs.cmu.edu 

Cc: bradiay«(erraa: . rutaers.edu, poggio9wheacies.ai.mic.edu. 
sontagC* c ermac . rucgers . edu 



In-Reply-Tc pogg io&wneaties . ai . mi c. edu • s message of Tue 17 Jan 89 
47:17 EST 

Subject: Kolmogorov s superposicicn theorem 

Kolmogorov's theorem and its relation to networks 
are discussed in B-ci. Cyber.. 37. 167-186. 1979. 
(On che representation of muiti- input systems: 
computational properties of polynomial algorithms, 
t'ooaic ana He i chare: - There are references were 

claer papers tsae especially tne two nice papers 
cy H. AiaeiSor. ; 



end o r messaae 



Poggio and Keichardt's paper. -On the 
representation or mule: -input systems: Computational 
properties cf polynomial algorithms' (Biol. Cyber., 37 
167-186. I960) appeared, not earlier buc. in che same 
year as deF i que i redo ' s . * Irr.pl icat ions and applications ot 
Kolmogorov's superposition tneorem- ( IEEE Trans, on 
Automatic Control. AC 25 . 1227-1231. 1980). 



Dtext 2 6 

From Connect ionises -Regues t 9G ■ CS . CMU EDU Wed Jan 18 13:38-05 1989 
Received: from 0-CS CKU.EDU by 8 . CP . CS . CMU . EOU; 18 Jan 89 13:37:«5 EST 
Received: from CS.CMU.EDU by 0 . CS . CMU . EDU . 17 Jan 89 22:52:50 EST 
Received: from RICE-CHEX.AI .MIT.EDU by CS.CKU.EDU. 1? Jan 89 22 :S'. : 29 EST 
Received: by rice-cnex.ai.mit.edu; Tue. 1 ^ Jan 89 22:47.17 EST 
Date: Tue. 17 Jan «9 22:47:17 EST 

From: pcggioOwheaties.ai.mit.edu (Tomaso Poggio) 
Message- Id : <8901 180347. AA2 1 088Gr ice-chex . ai . mi t . edu> 
To: sontagatermat.rutgers.edu 

Cz: Ccnnectionists9cs.cmu.edu. ruiQzeta.rice.edu 

In-Reply-To: sontagGtermat .rutgers.edu' s message ot Tue. 17 Jan 89 14:08:03 Z27 < 890U7 19 08 . AA00964Qconc r= ■ rutqers edu 
Subject: Kolmogorovs superposition theorem 

Kclmogorov 's theorem and its relation to networks are discussed in 
Biol. Cycer. 37. 167-196, 1979. (On tne representation of multi-input 
systems: computational properties of polynomial algorithms. Poggio and 
Reicharaci. There are references there to older papers (see especially 
the two nice papers by H. Abelson) . 



<radish: 12> egrep - Kolmogorov * »> -/ project / ear ly-stats 
arch . 9901 : Subject . Kolmogorov's superposition theorem 

arch. 8901 : The implementation of the Kolmogorov-Arnold-Sprecher Superposition Theorem 
arch. 3901 :( 1 1 A . K . Kolmogorov . "On the representation ot continuous functions o: several 
arch. 8901 : (3 I D . A . Sprecher . "An improvement in che superposition theorem of Kolmogorov - 
arch. 9901 :( 4 J Rui J . P de Figueiredo. ^Implications and applications of Kolmogorov's 
arch. 8901 ; (51 R . Hecht -Niel sen , "Kolmogorov's mapping neural network existence tneorem. - 
arch. 8901 : Subject : Kolmogorov's superposition theorem 

arch . 8901 : Kolmogorov *s theorem and its relation to networks are discussed in 
arch. 8901: Subject: Kolmogorov's superposition theorem 

arch. 8901: Kolmogorov's theorem and its relation to networks 

arch. 8901: Kolmogorov's superposition theorem- (IEEE Trans, on 

arch. 8903 : paper on "Kolmogorov's Mapping Neural Network Theorem* (1987 INNS proceedings*: 
arch . 8906 : The mistake in this line of reasoning is as follows: Using the Kolmogorov 
arch. 8906 : Natura 1 iy. ideas oased on Kolmogorov complexity are easiest to apply if we 
arch. 8906 :>> The mistake in this line of reasoning is as follows: Using the Kolmogorov 
arch . 8906 : Moreover . any discussion of Kolmogorov complexity vs. modeling errors 
arch. 8906 : about the Kolmogorov-Chai tin view of complexity is that ( m the limit) it 
arch. 8906 : Subject : .".olmogorov-Chaitin complexity (was: Categorization and Supervision! 
arch. 8906: aoout che Ko Imogorov -Cha i t in view of complexity is chat lin the limit) it 
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arch . 89 10 : using Kolmogorov-Chaitin algorithmic complexity applied to dynamical 

arch. 9001 : Representation properties o£ networks: Kolmogorov's theorem is irrelevant 

arch . 9001 : instruct ions (Kolmogorov). I know that only a truly random string is 

arch . 9001 : The ideas of minimum description lenght (mdl I , Kolmogorov-Chai tin complexity, 

arch . 900R : Kolmogorov complexity of an architectural description be useful as a 

arch . 9 105 : 'Chai t m-Kolmogorov Complexity and Generalization in Neural 

arch . 9 1 12 ; Kolmogorov ' s Theorem is Relevant 

arch . 9207 : genera lizat ion <Vapnix-Chervonenkis dimension, Kolmogorov methods, 
arch. 92 10:* Kolmogorov Complexity; Universal Prior Distribution; generating "hard' 
arch. 9210:* A. scheme for generating compressible binary vectors motivated by Kolmogorov 
arch.9301:On the Realization of a Kolmogorov Network 

arch. 93 OS : The Kolmogorov Signal processor. Prof. M.A. Lagunas, A. Prez . M. Najar, A. Pags.. UPC 
arch . 9 404 : Soloraonoff , Kolmogorov, Chaitin. Wallace. Rissanen, and others. The 
arch . 9404 : Kolmogorov complexity, algorithmic information theory, generalized 
arch . 9404 : Kolmogorov complexity, minimum message- length inference, the minimum 
arch . 9408 : Sub ject : Kolmogorov complexity and generalization 

arch. 940ft: of those ' based on Kolmogorov complexity and Solomonoff's 

arch. 9403: Levin complexity {a time- bounded generalization of Kolmogorov 

arch. 9403: simple' ' neural networks with low Kolmogorov complexity and high 

arch . 950 1 :( \em Learnability of Kolmogorov-easy circuit expressions via queries } . 

arch . 950 1 :{ \em Kolmogorov numbenngs and minimal identification). 

arch. 9S05 : Learnabi 1 i ty of Kolmogorov-Easy Circuit Expressions Via Queries 

arch. 9505: the case in which the circuit expressions are of low ( c lme- bounded) Kolmogorov 

arch. 9505: of the results to various Kolmogorov complexity bounds is discussed 



- 73 - 



WO 97/46962 



PCT/US97/09161 



008r-proparing-data-eets-dig'ost:-£o Fri May 2 09:30:35 1997 i 

i28-May-96 : Additional r.zzes cn how breaking up text into appropriate 
segments. These notes supplement the description of the formatting 
code . \ 



KAK: 22 August 1995 

created datasets using fiimus- 1. forgot to document it. am now 
following up in the same fashion with wisa. 

with archives £ilmus-i. itrse-1. and wisa, they arrive in chunks of 
digest format through email. then an imail hook runs and concatenates 
them all to one file. we need to run headerspar ser to get rid of the 
mbox headers. note that ~his assumes the archives are in NOT -mbox 
format, because the scrip- looks for the pinhead to strip out. 



<radish 
/ home / k 
<radish 
<radish 
total 1 
-rw-r- - 
•cradish 
<radish 
total J 
- rw-r-- 
- rw- r - - 



1> 
know 

2> 
3> 

60 

4» 
5> 



pwd 

1 es/ pro j ect /expi orations 
cd . / archives - from- 1 is tserv/wisa/ 
Is 

1 bl0112^: 73962 Jul 5 14:46 wisa-archives 
header-parser wi sa-archives just -archives 
Is 



1 b!0112c;- 
1 b!0112c; 



72192 Aug 23 15:34 just-archives 
73982 Jul 5 14:46 wi sa -archives 



okay, now that's done. "w i'd like to create the new test data 
directory, and parse tr.i = archive into separate numbered messages. 



< radi sh . 
<radish : 



6> mkdir /radii* 070/projwisa 

d> mkdir /radi sr. 070/projwisa/messages 



so we have parse-digest vhich uses the line "~I3ate:~ as a message 
delimiter, and we have trim-digest, which removes tne message separator 

























<radish 


: 9> 


par* - 


o -di g e s ~ 


:ust -archives ' 


radish/ 07 0 /pro jwi 


< radish 


: 10> 


1 s 




"0/proj wisa /messages 






total 165 




















- rw- r - - 


r- - 


I 


kknowiei 


1193 


Auq 


2 3 


15 : 


40 


00001 . 


mbcx 


-rw-r- - 




1 


kknowl es 


1335 


Aug 


2 3 


15 : 


40 


00002 . 


mbox 


-rw-r- - 




1 


kknowlc; 


832 


Aug 


2 3 


15 : 


40 


00003 . 


mbox 


- rw- r- - 


r - - 




kknowi es 


L368 


Aug 


2 3 


15 : 


40 


00004 . 


mbox 


- rw- r- - 


r - - 


1 


kknowles 


861 


Aug 


2 3 


15 : 


40 


00005 . 


mbox 


-rw- r- - 


r- - 


l 


kknowiei 


2216 


Aug 


2 3 


15 : 


40 


00006 . 


mbox 


- rw- r - - 




I 


kknowl e= 


1563 


Aug 


23 


IS : 


40 


000C7 


mbox 


- rw- r- - 


r - - 


1 


k know 1 e = 


4039 


Aug 


23 


15 : 


40 


00008 . 


mbox 


- rw- r- - 


r - - 


1 


kknowl es 


185"? 


Aug 


t T 


15 : 


40 


00009 


mbox 


- rw- r — 


r - - 


1 


kknowl * = 


1057 


Aug 


23 


15 : 


40 


00010. 


mbox 


- rw- r- - 


r - - 


1 


kknowU; 


L163 


Aug 


2 3 


15 : 


40 


00011 


mbox 


- rw- r- - 


r — 


I 


kknowl C £ 


1397 


Aug 


2 3 


15 : 


40 


00012 


mbox 


- rw- r - - 


r - - 


I 


kknowle? 


1799 


Auq 


23 


15 : 


40 


0001 3 


mbox 


-rw-r - - 


r - - 


1 


kknowiei 


790 


Aug 




15 


40 


00014 


mbox 


- rw-r- - 


r - - 


1 


kknowi es 


874 


Aug 


2 3 


IS : 


40 


00015 


mbox 


-rw-r - - 




1 


kknowl es 


674 


Aug 


23 


15 : 


40 


00016 


mbox 


-rw-r-- 


r - - 


1 


kknowiei 


1195 


Aug 


23 


15: 


40 


00017 


mbox 


- rw-r - - 


r-- 


1 


kknowl es 


2528 


Aug 


23 


15 • 


40 


00018 


mbox 


- rw-r - - 


r - - 


1 


kknowiei 


2372 


Aug 


23 


15 


40 


00019 


mbox 


- rw-r-- 


r 


1 


kknowl c = 


2627 


Aug 


23 


15 


40 


00020 


mbox 


- rw- r- - 


r -- 


1 


kknowle* 


1937 


Aug 


23 


15 


40 


00021 


mbox 


- rw-r-- 


r -- 


1 


kknowl es 


1037 


Aug 




15 


40 


00022 


mbox 


- rw- r - - 


c - - 


1 


kknowle = 


633 


Aug 


23 


IS 


40 


00023 


mbox 


- rw- r-- 


r - - 


1 


kknowle£ 


901 


Aug 


23 


15 


40 


00024 


mbox 


- rw- r - - 


r- - 


1 


kknowle= 


991 


Aug 


23 


15 


40 


00025 


mbox 


- rw- r - - 


r - - 




kknowiei 


1854 


Aug 


23 


15 


40 


00026 


mbox 


- rw- r- - 


r - - 


1 


kknowi c = 


7846 


Aug 


23 


15 


40 


00027 


mbox 


- rw-r- - 




1 


kknowi = = 


1008 


Aug 


23 


15 


40 


00028 


mbox 


- rw- r - 


■ r - - 


1 


kknowi € = 


1388 


Aug 


23 


15 


40 


00029 


mbox 


- rw- r - 


■ r - - 




kknowi 6 • 


1152 


Aug 


23 


15 


40 


00030 


.mbox 


- rw-r- - 


■ r- - 


i 


kknowl e • 


2198 


Aug 


23 


15 


40 


00031 


. mbox 


- rw- r - 


- r - - 


1 


kknowi 6 r 


364 


Aug 


23 


15 


40 


00032 


. mcox 


- rw- r - 


- r - - 


1 


kknowi fc? 


997 


Aug 




15 


40 


00033 


. mbox 


- rw- r - 


- r- - 




kknowiei 


1128 


Aug 


23 


15 


40 


00034 


. mbox 


-rw-r- 


- r- - 


1 


kknowi c i 


103 1 JtAug 


23 


15 


40 


00035 


. mbox 


-rw-r* 


- r - - 


1 


k k now l£; 


908 


Aug 


23 


15 


40 


00036 


. mbox 


- rw- r - 


- r - - 


I 


kknowle h 


«95 


Aug 


23 


15 


40 


00037 


. mrjox 


-rw- r- 


- r - - 


1 


kknowi es 


511 


Aug 


2 3 


15 


40 


00038 


. mbox 


- rw- r - 


- r - - 


1 


kknowiei 


1239 


Aug 


23 


15 


.40 


000 3 9 


. mbox 


- rw- r - 


- r - - 


1 


kknowiei 


843 


Aug 


2 3 


15 


:40 


00040 


. mbox 


- rw- r- 


- r - - 


1 


kknowl ei 


755 


Auq 


23 


15 


: 40 


0004 1 


. mbox 


-rw-r- 


-r-- 




kknowiei 


572 


Aug 


2 3 


15 


:40 


00042 


.mbox 


- rw- r - 


- r - - 


1 


kknowles 


425 


Auq 




15 


: 40 


00043 


. mbox 


-rw- r - 


-r-- 


I 


kknowiei 


2097 


Aug 


2 3 


15 


: 40 


00044 


. mbox 


- rw- r - 


- r - - 




kknowie! 


1334 


Aug 


-> 3 


15 


:40 


00045 


. mbox 


- rw- r- 


- r - - 


I 


kknowi e = 


557 


Aug 




15 


: 40 


00046 


. mbox 


-rw-r- 


- r - - 




kknowi c i 


2594 


Aug 


23 


15 


: 40 


00047 


. mbox 


- rw- r - 


- r - - 


1 


k k now i e r 


1409 


Aug 


2 J 


lb 


: 40 


00048 


. mbox 


-rw-r- 






kknowi; i 


1474 


Aug 


2 3 


15 


: 40 


00049 


. mbox 



< radish: 12~» is /radi sr. 



?0/pro]wisa/ messages 
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total 156 



- rw 


- r - 


-r- - 


L kknowles 


1119 


Aug 


23 


15 


42 


00001 


. mbox 


- rw 


-r - 


- r- - 


1 kknowles 


1261 


Aug 


23 


IS 


:'42 


00002 


.mbox 


- rw 




- r - - 


I kknowles 


758 


Aug 


23 


15 


. 42 


00003 


. mbox 


- rw 


- r - 


- r - - 


1 kknowles 


1294 


Auq 


23 


15 


: 42 


00004 


.mbox 


- rw- r - 


- r - - 


1 kknowles 


78*7 


Aug 


23 


15 


: 42 


00005 


.mbox 


- rw 


- r - 


- r- - 


1 kknowles 


2142 


Aug 


23 


15 


: 42 


00006 .mbox 


- rw 


- r - 


- r - - 


1 kknowles 


1489 


Aug 


2 3' 


15 


: 42 


00007 


. mbox 


- rw 


- r - 


- r- - 


I kknowles 


3965 


Aug 


23 


15 


: 42 


00008 


. mbox 


- rw 


- r - 


- r- - 


L kknowles 


1783 


Aug 


23 


15 


:42 


00009 


. mbox 


-rw 


-r - 


- r-- 


1 kknowles 


983 


Aug 


23 


15 


:42 


00010 


. mbox 


- rw- 


- r - 


- r- - 


1 kknowles 


1089 


Aug 


23 


15 


:42 


00011 


.mbox 


- rw- 


- r - 


- r - - 


1 kknowles 


1323 


Aug 


23 


IS 


:42 


00012 


.mbox 


- rw- 


■r- 


- r- - 


1 kknowles 


1725 


Aug 


23 


15 


: 42 


00013 


.mbox 


- rw- 


■r- 


- r- - 


2 kknowles 


716 


Aug 


23 


15 


;42 


00014 


.mbox 


- rw- 


-r - 


-r- - 


1 kknowles 


800 


Aug 


23 


15 


: 42 


0001S .mbox 


- rw- 


- r - 


- r- - 


l kknowles 


600 


Aug 


23 


15 


: 42 


00016 


.mbox 


- rw- 


-r- 


-r- - 


1 kknowies 


1121 


Aug 


23 


15 


: 42 


00017 


.mbox 


- rw- c - 


- r - - 


1 kknowles 


2454 


Aug 


23 


15 


: 42 


00018 


.mbox 






- r - - 


1 kknowles 


2298 


Aug 


23 


15 


: 42 


00019 


.mbox 


- rw - 


•r - 


- r - - 


1 kknowles 


25S3 


Aug 


23 


15 


42 


00020 


.mbox 


- rw- 


r - 


- r - - 


I kknowles 


1863 


Aug 


23 


15 


42 


00021 


mbox 


- rw - 


r - 


- r - - 


1 kknowles 


963 


Aug 


23 


15 


42 


00022 


mbox 


- rw- 


r- 


-r - - 


I kknowies 


S59 


Aug 


23 


15 


42 


00023 


mbox 


- rw- 


r - 


- r- - 


1 kknowies 


827 


Aug 


23 


15 


42 


00024 


mbox 


- rw - 




- r - - 


1 kknowies 


917 


Aug 


23 


15 


42 


00025 


mbox 


- rw- 


r- 


-r- - 


I kknowles 


1780 


Aug 


23 


15 


42 


00026 


mbox 


-rw- 


r- 


- r - - 


1 kknowies 


7772 


Aug 


23 


15 


42 


00027 


mbox 


- r-w- 


r - 


- r - - 


1 kknowies 


934 


Aug 


23 


15 


42 


00028 


mbox 


-rw- 




- r - - 


1 Kknowies 


13 14 


Aug 


23 


15 


42 


00029 


mbox 


- rw- 


r- 


-r-- 


\ kknowies 


1078 


Aug 


23 


IS 


42 


00030 


mbox 


- rw- 


r - 


- r - - 


1 kknowles 


2124 


Aug 


23 


15 


42 


00031 


mbox 


- rw- 


r- 


- r - - 


1 kknowles 


790 


Aug 


23 


15 


42 


00032 


mbox 


- rw- 


r- 


- 1 - - 


I kknowles 


923 


Aug 


23 


IS * 


42 


00033 


mbox 


- rw- 


r - 


-r — 


1 kknowies 


1054 


Aug 


23 


15 • 


42 


00034 


mbox 


- rw- 


r - 


- r 


i kknowles 


957 


Aug 


23 


15 : 


42 


0003S. 


mbox 


-rw- 


r- 


-r-- 


I kknowles 


834 


Aug 


23 


15 : 


42 


00036. 


mbox 


- rw- 


r- 


- r - - 


1 kknowles 


621 


Aug. 


23 


15 : 


42 


00037 . mbox 


- rw- 


r- 


- r - - 


i kknowles 


437 


Aug 


23 


15: 


42 


0003 8 .mbox 


- rw- 


r- 


- r - - 


1 kknowies 


1165 


Aug 


23 


15 : 


42 


00039 . 


mbox 


- rw- 


r- 


- r - - 


i kknowles 


769 


Aug 


23 


15 : 


42 


00040. 


mbox 


- rw- 


r- 




1 kknowies 


661 


Aug 


23 


IS : 


42 


00041 . 


mbox 


- rw- 


r - 


- r - - 


I kknowies 


498 


Aug 


23 


15 : 


42 


00042. 


mbox 


- rw- 


r- 


- r - - 


1 kknowies 


351 


Aug 


23 


IS : 


42 


00043 . 


mbox 


- rw- 


r- 


- r - - 


1 kknowies 


2023 


Aug 


23 


15 : 


42 


00044 . 


mbox 


- rw- 


r- 


- r - - 


1 kknowles 


1334 


Aug 


23 


15 : 


42 


00045. 


mbox 


- rw- 


c- 


- r - - 


1 kknowles 


483 


Aug 


23 


15 : 


42 


00046. 


mbox 


- rw- 


r - 


- r - - 


i kknowles 


2520 


Aug 


23 


IS- 


42 


0004*7 .mbox 


-rw- 


r - 




1 kknow.es 


1335 


Aug 


23 


IS : 


42 


00048 . 


mbox 


- rw- 


r- 


- r - - 


1 kknowies 


1474 


Aug 


23 


15 : 


42 


00049 . 


mbox 



okay, so we see we have 4 9 messages. 

this done, we copy over che crea te_f u 1 1_ £ i les file from somewhere, 
like / radish/070/ pro} f ilml/ . 

<radish: 14> cd / radisr./070/pro j wi sa 

<radish: 1S> cp . /pr : ; £ i iml /create_£u 1 1_ £ i les . / 

<radish: 16> is 

total 11 

-rwxr-xr-x l kknowies 4303 Aug 23 15:48 create. ful 1_ £ i les • 
drwxr-xr-x 2 kknowies 1024 Aug 23 15:42 messages/ 

and we alter cr ea te_f _:i_f iies for this set. the only place that needs 
to become personalised is the location ot the smart databases. of 
course, this ail beccir.es irrelevant when we realize we have little place 
to get the answers* :r the 'truth* from. but we can try. 

<radish: 17> create. f_ii_f iles 2> eff-log 
Wed Aug 23 15:52:21 E^T 1995 
Wed Aug 23 15:52:22 1995 
/ radish/ 070 /pro jwisa 

Indexing docs at Wed Aug 23 15:52:23 EDT 1995 

0 . 3u 2.6s 0:04 6l\ 0*:*< 71*12io 53p£*Ow ' 
All done at Wed Aug 25 15:52:27 EDT 1995 
Wed Aug 23 15:52:27 ZZT 1995 
/ radish/ 070 /pro iwisa 

Indexing docs at Wed Aug 2 3 15:S5*:28 EDT 199 5 
0.2u 0.6s 0:01 64% 0-vx 4+12io lp£*0w 
All done at Wed Aug 2: IS: 52 =28 EDT 1995 
Wed Aug 2 3 15:52:29 Z2T 19 95 
/ radish/ 07 0 /pro jwisa 

Indexing docs at Wed Aug 2 3 15:52:29 EDT 19 95 

0.3u 2.5s 0:03 851 O-Zk 5*12io 0pf»0w 

All done at Wed Aug jj 15:52:32 EDT 1995 

Wed Aug 23 15:52:32 Z2T 1995 

Wed Aug 23 15:52:32 EZT 1995 

Wed Aug 2 3 15:52:33 EZT 1995 

Wed Aug 23 15:52:33 Z27 1995 

Wed Aug 2 3 15:52:33 E2T 19 95 

Wed Aug 2 3 15:52:33 E2T 19 95 

Wed Aug 2 3 15:52:34 E-T 19 95 
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Wed 


Aug 


23 


15 : 


52: 


34 


EDT 


1905 


Wed 


Aug 


23 


15- 


S2: 


34 


EDT 


1995 


wed 


Auq 


23 


15: 


52: 


:34 


EOT 


1995 


Wed 


Aug 


23 


15: 


.52 ; 


: 35 


EDT 


1995 


wed 


Aug 


23 


15: 


b2 ; 


:36 


EOT 


1995 


Wed 


Aug 


23 


15 : 


52 


:36 


EDT 


1995 


Wed 


Aug 


23 


15; 


:52 


:36 


EDT 


199S 


Wed 


Aug 


23 


15: 


; 52 


:36 


EOT 


199S 


wed 


Aug 


23 


IS 


: 52 


: 36 


EDT 


1995 


Wed 


Aug 


23 


IS 


: 52 


:37 


EDT 


1995 



okay, so let' 


s look at what 


we' ve 


got 




<radish : 18> 


pwd 








/radish/ 070 /pro jwisa 








<radish: 19> 


Is 








total 27 










- rw- r - - r - - 


1 kknowles 


1354 


Aug 


23 


-rw-r--r- - 


1 kknowles 


0 


Aug 


23 


-rw-r- -r- - 


1 kknowles 


0 


Aug 


23 


-rw-r--r - - 


1 kknowles 


0 


Aug 


23 


-rw-r - - r - - 


1 kknowles 


0 


Aug 


23 


- rw- r - - r - - 


1 kknowles 


0 


Aug 


23 


- rw- r - - r - - 


I kknowles 


0 


Aug 


23 


- rw- r-- r - - 


1 kknowles 


0 


Aug 


23 


- rw- r - - r - - 


I kknowles 


0 


Aug 


23 


- rw- r - - r - - 


1 kknowles 


0 


Aug 


23 


-rw- r--r-- 


1 kknowles 


0 


Aug 


23 


-rwxr-xr-x 


1 kknowles 


4301 


Aug 


23 


-rwxr-xr-x 


1 kknowles 


4303 


Aug 


23 


- rw- r --r - - 


1 kknowles 


0 


Aug 


23 


-rw-r - - c - - 


1 kknowles 


0 


Aug 


2 3 


-rw- r - - r - - 


1 kknowles 


0 


Aug 


23 


- rw-r--c - - 


I kknowles 


0 


Aug 


23 


-rw-r- - c - - 


1 kknowles 


0 


Aug 


23 


drwxr-xr - x 


2 kknowles 


1024 


Aug 


23 


-rw-r--r- - 


1 kknowles 


0 


Aug 


.23 


- rw- r - - r - - 


1 kknowles 


0 


Aug 


23 


-rw-r - - r - - 


I kknowles 


0 


Aug 


23 


- rw- r - -r- - 


1 kknowles 


0 


Aug 


23 


-rw-r- - r - - 


1 kknowles 


0 


Aug 


23 


drwxr-xr-x 


2 kknowles 


1024 


Aug 


23 


- rw-r-- r-- 


1 kknowles 


0 


Aug 


23 


-rw-r--r-- 


1 kknowles 


0 


Aug 


23 


- rw- r--r-- 


1 kknowles 


0 


Aug 


23 


■rw-r-r -- 


1 kknowles 


0 


Aug 


23 


-rw-r- -r- - 


1 kknowles 


0 


Aug 


23 


drwxr-xr-x 


2 kknowles 


1024 


Aug 


23 



15 
15 
15 
15 
15 
15 
15 : 
15 : 
15 : 
15 : 
15 : 
15 : 
15 : 
15 : 

IS : 

15 : 

15 : 

15 : 

15 : 

15 : 

IS . 
15 : 
IS: 
IS 
IS 
IS 
IS 
15 
15 
15 



52 cff-log 

52 child-parent-pairs-man- id 
52 cprs-qq-qu 
52 cprs-qq-qu-uq 
52 cprs -qq-qu-uq-uu 
52 cprs-qq-qu-uq -uu-S 
52 cprs- quo tq-quot 
52 cprs -quo tq-unquot 
52 cprs-subj 
52 cprs-unquotq-quot 
52 cprs -unquotq-unquot 
50 create_£ull_f iles' 
create_full_f iles-- 
f ull -parent- rank-quo tq- que l 
full -parent -rank -que tq-unquc: 
f ul 1 -parent -rank- sub j 
£ul 1 -parent- rank -unquotq-quct 
full -parent- rank -unquotq-unquc-it 
messages/ 

nura-retrieved-quotq-quct 
52 num-recr ieved-quo tq-unquot 
52 num-retrieved-subj 
52 num-retrieved-unquotq-quot 
52 nuxn- retrieved- unquotq-unquot 
52 quotes/ 

52 rp-table-quotq-quot 
52 rp-table-quotq-unquot 
52 rp-table-subj 
5 2 rp-table-unquotq-quot 
52 rp- table -unquotq-unquot 
52 subjects/ 



which means that basically, i think everything is working, oniy there 
are no irt identifications, so there is no data to be had. we could, 
however, run everything against everything else, old style, using 
query- run -better . 

<radish: 62> query-run-better quotes/ / radish/070/ kim-ema i 1 -dbas.s , pro r.aunquoL / spec ->C i u 1 1 -query - quoc q -unquot 

<radish: 64> query-run-better quotes/ /radish/070/ki»-«aail -<Jbas«/pro jwisdquut / spec 50 fui i -query-quocq-quc: 

which works just fine; so maybe with the digest case, we just want to icck at. this. we 
can definitely work with the entire wisa archive, and maybe we can test or. other 

<radish: li7> create_£ull_f iles 2> c££-log 
Wed Aug 23 18:02:23 EDT 1995 
Wed Aug 23 18:02:24 EDT 1995 
/ radish/ 070/ pro jwisa 

Indexinq docs at Wed Aug 23 18:02.25 EDT 1995 
0.3u 2.4s 0:03 69% 0*0k 9+31io 2p£*0w 
All done at Wed Aug 23 18:02:27 EDT 1995 
Wed Aug 23 18:02:28 EDT 1995 
/ radish.' 07 0 /pro jwisa 

Indexing docs at Wed Aug 23 18:02:29 EDT 1995 
0.6u 2.6s 0 04 79% 0*0k 502io 0p£*0w 

All done at Wed Aug 2 3 18:02:32 EDT" 19 9 5 — 
Wed Aug 23 18:02:32 EDT 199S 
/ radish/ 070 /pro jwisa 

Indexing docs at Wed Aug 23 18:Qjg-:33 EDT 1995 

0.3u 2.5s 0:03 72% 0 + 0k 3*33io 0p£*0w 

All done at Wed Aug 2 3 18:02:£fi EDT 1995 

Wed Aug 23 18:02:36 EDT 1995 

Wed Aug 23 18:02:52 EDT 1995 

Wed Aug 2 3 18:02:52 EDT 199 5 

Wed Aug 23 18:03:08 EOT 1995 

Wed Aug 23 18:03:08 EDT 199S 

Wed Aug 23 18:03:39 EDT 1995 

Wed Aug 23 18:03:39 EDT 1995 

Wed Aug 2 3 18:04:12 EDT 199 5 " 
Wed Aug 23 18:04 13 EDT 1995 
Wed Aug 2 3 18:04:41 EDT 199 5 
Wed Aug 2 3 18:04.42 EDT 1995 



Wed Aug 2 3 16:04:43 EDT 19 95 
Wed Aug 2 3 



Wed Aug 2 3 



04:43 EDT 1995 
18 0 4:43 EDT 1995 
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Wed Aug 23 18 
Wed Aug 2 3 18 
Wed Aug 2 3 13 
<radish: 118> 
total 231 
- rw-r-- r- - 
drwxr-xr-x 
-rw-r-- r-- 
drwxr-xr-x 
- rwxr-xr- x 
drwxr-xr-x 
drwxr-xr-x 
drwxr-xr-x 
drwxr-xr-x 
drwxr-xr-x 
drwxr-xr-x 



0<1:4C EDT 1995 

04:44 EDT 1995 

04:46 EDT 199 5 
Is 



1 kknowles 

2 kknowles 

1 kknowles 

2 kknowles 

1 kknowles 

2 kknowles 
kknowles 
kknowles 
kknowles 
kknowles 
kknowles 



696 Aug 23 

512 Aug 23 

106095 Aug 23 

512 Aug 2 3 

4 3 4 9 Aug 2 3 

512 Aug 23 

102 4 Aug 2 3 

S12 Aug 23 

1024 Aug 23 

1024 Aug 23 

1024 Aug 23 



18:04 cff-log 

18:04 cprs-concat/ 

•8:04 cprs-tinal 

18:04 cprs-origs/ 

18:01 create_f ul l_£i les" 

18:04 ful 1 -queries/ 

15:42 messages/ 

18:04 num-rets/ 

18:02 quotes/ 

18:02 subjects/ 

18 : 02 unquotes/ 



so we " re done . 
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'.Vi-Wy-E = Code which drives SMART to sec up the database ot 
4 tjuoced material. 

» usage: make .text.uquot collection dbase 

• set echo verbose 
» 

•set bin * ' /home/smact/src/bin 
set bin - /hotne/ lewis/public/ sgi/bin 

(♦see tlibdir = /home/ smart/ lib A /hk 
.a" clibdir = /home/lewis/pubUc/src/smart/smart.ll.O/Ub 

• sec database = /horoe/smart/indexed_colls/text 

.set database = / radish/ 070/kin,-email-dbase S /pro,mai subl 

• sec database = zradish/070/kim-e.nail-dbases/pcojmailauot 

^cabase - / radish/ 070/ki»— «il -dbases/pro,test,uot 
Is" dtcabase , /radish/070/ kim-ema i 1 -dbases / pro 3 test L Okquot 
set database = S2 

tset coll = Scwd . 

♦set coll = /radish/070/projtrain/sub,ects/ 

#set coll = /radish/070/pro. train/quotes/ 

• set coll = / r adish/070/projtestl0k/ciuotes/ 
set coil = SI 

. create the empty coLieccior. (destroying any existing collection, 
/bin/rm -rf Sdacabase 
mkdir Sdatabase 

find Scoll -type t -print > -database/ doc Loc 

• for some reason, the Is method didn t work cabase/doc 
((/bin/is -1 /rodish/070'projirAin/sub 3 eccs/ - Sdatabase/ do. _ .oc 

cat > Sdatabase/ spec << EOF1 
»# INFORMATION LOCATIONS 
database Sdatabase 
include.file stl ibdir/ spec . default 

#« TEXT DOCDESC 

«««• GENERIC PREPAHSER 

pp.detault .section.name w 

pp.de£ault_section_action copy 

«*#• DESCRIPTION OF PARSE INPUT 

index. nuw.sect ions 1 

index . sect ion . 0 . name w f,,n 

index. section. 0. method index . parse_ S ect . £ul 1 

index. section. 0 .word- ctype 0 

index . section . 0 . proper ccype 0 wo-c tor quoted material 

< ■ i « *we want to maintain tne s-ot- -"u.^ . -i 

Kctype. 0. text_stop_f i le " we 

t i t le_section 0 

*M» DESCRIPTION OF FINAL VECTORS 
num_c types 1 



ft* ALTERATIONS OF STANDARD PARAMETERS 
doc_weight nnc 
query_weight ncc 

#* ALTERATIONS OF STANDARD PROCEDURES 
EOF1 

•index the collection 

rbin/srr: i ind d erdoc r sdttr b ase, S pec < Sdatahase/doc.loc 
t ime 

echo All done at 'date' 
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J 28-May-96 : Code which drives SMART Co set up the database of 

• subject lines. 

it usage: make_texc_subj collection dbase 

• set echo verbose 
tt 

•set bin = ' home/smart/ src/bin 

set bin - / home/ lewis/public/ sgi/bin 
•set tlibdir = / home /smart/ lib 

set tLibdir - .home/ lewis/public/src/ smart / smart . 11 . 0/ lib 

• set database = /home/ smart / indexed_co lis /text 

•set database = / radish/070/kim-email-dbases/projmailsubj 
•set database = / radish/ 070/kim-email -dbases/pro j testsubj 
•set database - • radish/070/kim-emaii -dbases/pro j test lOksubj 
set dataDase = $2 
•set coll - Scwd 

•set coll - / radish/070/proj train/subjects/ 

• set coll = ■ rddish/070/projtestl0k/subjeccs/ 
sec coll - SI 

• creace the empcy collection (destroying any existing collection) 
/bin/rm -rf Sdatabase 

mkdir Sdatabase 

find Sccli -type £ -print > Sdatabase/doc_loc 

• for some reason, the Is method didn't work... 

0/bin/ls -1 .■raGish/Q70/projcrair\/subjeccs/ > Sdatabase/doc_ ioc 

cac » Sdatabase ' spec ■:< EOF1 

• • INFORMATION* LOCATIONS 
database Sdatabase 
include_file Stlibdir/spec default 

«• TEXT DOCDE3w 

• «•• GENERIC ??. SPARSER 

pp . def.au It _sect ion_narae w 
pp . de£ault_section_act ion copy 

«•#« DESCRIPTION OF PARSE INPUT 

index . num_ sect ions 1 

index . sect icr. . 0 . name w 

index. section . 0. method index . parse.sect . full 

index, sect icr. . 0 . word, c type 0 

index, sec: i-:r. . 0 . proper . ccype 0 

ctype.0.text..s-;p_file • we don't want a stop word list on trie sub)ecc 

tit le_ sect ion " 

»**• DESCRIPTION OF FINAL VECTORS 
num_ctypes 1 

It ALTERATIONS TjF STANDARD PARAMETERS 
doc_weight nnC 
query__weig;nc nCC 

*u ALTERATIONS 2F STANDARD PROCEDURES 
EOF1 

•index the collection 

echo Indexing docs ac 'date* 

Sbin/smart index.doc Sdatabase/ spec < Sdatabase/doc.loc 
time 

echo All done a: 'date* 
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jaake_text_u<iuot 'ri May 2 09:31:01 1997 

'. ! 28-May-9G : Code which drives SMART to ..t up the database of 

• unquoced material . 

• usaqe make_text_uquo t collection dbase 
I set echo verbose 

• 

• set bin = /home/smart/src/bin 
set bin - /home/ lewi s/public/ sgi/ bin 

• set tlibdir - /nome/scaart/ lib o/nv, 
set tlibdir * /home/Lewis/public/src/smart/smart.U.O/lib 

#set database * / home/ smart/ indexed.col Is/ text .... 

• set database - / radish/ 070 /kim-email-dbases/pro3mai lsub 3 

• set database - / radish/ 070/kicn-email-dbases /proimai lunquot 
llll database = / r adish/070/kim-email-dbases/pro3testunquot 
Mil database - / radish/070 /kim-email-dbases/pro, tes t lOkunquoc 
set database = S2 

•set coll = Scwd 

• set col 1 = / r adish/070/projtrain/subiects/ 

• set coll » /radish/070/projtrain/unquotes/ 
•set coll = /radish/070/projcestl0k/unquotes/ 
set coll = 51 

• create che empty collection (destroying any existing collection, 

/bin/rm -r£ Sdatabase 
mKdir Sdatabase 

find Scoll -type t print > Sdatabase/ doc_loc 
4 for some reason, the Is method didn't work. . . ■ 

•/bin/is -1 /radish/070/pro 3 crain/subiect 3 / > .database/ doc _ .cc 

cat > Sdatabase/ spec « EOF1 
#« INFORMATION LOCATIONS 
database Sdatabase 
include.file stlibdir/spec . default 

if TEXT DOCDESC 

•»«l GENERIC PREPARSER 

pp . default _section_name w 

pp.de£ault_section_action copy 

MM DESCRIPTION OF PARSE INPUT 

index . num_sect ions 1 

index. section. 0 .name w , 

index, section. 0 . method index . par se.sect Eli** 

index. sect ion. 0. word. ctype 0 

' index. section. 0. proper. ctype 0 

n we want Co ase the stop lis- 

•ctype . 0 . texc_stop_r l le ■ we w<3t "" 

t i tle_sect ion 0 

• •«« DESCRIPTION OF FINAL VECTORS 
num.ctypes 1 

«• ALTERATIONS OF STANDARD PARAMETERS 
doc_weight rinc 
query_woight ncc 

ft* ALTERATIONS OF STANDARD PROCEDURES 
EOF1 

•index the collection 

echo Indexing docs at 'date' . Hor , oc 

Sbin/ smart index.doc Sdatabase/ spec < Sdatabase/doc.loc 

time 

echo Ail done at "date" 
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What is Claimed is : 

1. A method of determining from a plurality of 
messages a second message that is related to a first 
message, comprising the steps of: 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b. generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at lease one message filter; 

c. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. determining from each of the degrees of match 
which one of the plurality of messages is the second 
message . 

2. The method according to claim 1, wherein the 
relationship of the second message to the first message is 
parent to child; 

wherein the first message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered. 

3. The method according to claim 1, wherein the 
relationship o£ the second message to the first message is 
child to parent; 
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wherein the first message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
5 message filter that extracts a quoted portion of the 
message being filtered. 

4. The method according to claim 1, wherein the step 
of determining the degree of match between the filtered 

10 first message vector and the filtered second message vector 
comprises use of a . stat ist ical information retrieval 
function . 

5. The method according to claim 1, wherein the step 
15 of determining from each of the degrees of match which one 

of the plurality of messages is the second message 
comprises determining which one of each of the degrees of 
match is the maximum value and selecting the message 
corresponding to the determined maximum value . 

20 

6. The method according to claim 4, wherein the step 
of determining :he degree of match between the filtered 
first message -ectcr and the filtered second message 
vectors further comprises combining a set of values 

25 resulting from the statistical information retrieval 
function to form a single value. representative of the 
degree of match. 

7. The method according to claim 6, wherein the step 
30 of determining from each of the degrees of match which one 

of the plurality of messages is the, -second message 
comprises determining which element of the tuple of values 
representative T of each of the degrees of match is the 
maximum value, and selecting the message corresponding to 
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the determined maximum value . 

8. The method according to claim 1, further 
comprising the step of if the first message is contained in 
the plurality of messages, removing the first message from 
the plurality of messages before filtering the plurality of 
messages using the second message filter bank. 

S. The method according to claim 1, further 
comprising the step of verifying that the second message is 
related to the first message. 

10. The method according to claim 9, wherein the step 
of verifying that the second message is related to the 
first message includes determining whether the degree of 
match between the filtered first message vector and the 
filtered second message vector corresponding to the 
determined second message exceeds a threshold value. 

11. The method according to claim 1, further 
comprising the step of presenting a list including the 
first message, at least one of the plurality of messages, 
and the degree of match between the filtered first message 
vector and the filtered second message vector corresponding 
to the at least one of the plurality of messages. 

12. A method of determining from a plurality of 
messages whether a second message is related to a first 
message, comprising the steps of: 

. a. generating a filtered first message vector by 
filtering the first message using - a" f irst message filter 
bank, said fiijst message filter bank comprising at least 
one message filter; 
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b. generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

c. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. determining for each of the set of filtered 
second message vectors whether the degree of match between 
the filtered first message vector and the filtered second 
message vector exceeds a threshold value. 



v 



13. A method of processing a plurality of messages 
that may be related to a first message, comprising the 
steps of : 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b. generating a set of filtered second message 
ectors by filtering each of the plurality of messages 

using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

c. ■ determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. presenting a list including the first 
message, at. least one of the plurality of messages, and the 
degree of match between the filtered -first message vector 
and the filtered, second message vector corresponding to the 
at least one of the plurality of messages. 
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14. A method of determining a thread of related 
messages from a plurality of messages, comprising the steps 

Of : 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b. if the first message is contained in the 
plurality of messages, retrieving the first message from the 
plurality of messages; 

c. generating a set of filtered second message 
vectors by filtering each of the plurality of messaaes 
using a second message filter bank, said second messaae 
filter bank comprising at least one message filter; 

d. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; 

e. determining from each of the degrees of match 
whether one of the plurality of messages is a second 
message related to the first message; and 

f. if it is determined that one of plurality of 
messages is a second message is related to the first 
message, substituting the second message in place of the 
first message and repeating each of the steps a through f 
herein . 

15. The method according to claim 14, wherein the 
relationship of the second message to the first message is 
parent to child;. 

wherein the first message filter bank .comprises a 
message filter* that - extracts a quoted portion of the 
message being filtered; and 
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wherein the second message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered. 

16. The method according to claim 14, wherein the 
relationship of the second message to the first message is 
child to parent ; 

wherein the first message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered. 



17. The method according to claim 14, wherein the 
step of determining the degree of match between the 
filtered first message vector and the filtered second 
message vector comprises use of a statistical information 
retrieval function. 

18. The method according to claim 14, wherein the 
step of determining from each of the degrees of match which 
one of the plurality of messages is the second message 
comprises determining which one of each of the degrees of 
match is the maximum value and selecting the message 
corresponding to the determined maximum value. 

19. The method according to claim 17, wherein the 
step of determining the degree of match between the 
filtered first message vector and the filtered second 
message vector further comprises combining a set of values 
resulting from t£ie statistical information retrieval 
function to foirm a single value representative of the 
degree of match. 
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20. The method according Co claim 19, wherein the 
step of determining from each of the degrees of match which 
one of the plurality of messages is the second message 
5 comprises determining which element of the vector 

representative of each of the degrees of match is the 
maximum value, and selecting the message corresponding to 
the determined maximum value. 

10 21. A method of determining a thread of related 

messages from a plurality of messages, comprising the steps 
of: 

a. generating a set of filtered first message 
vectors by filtering each cf the plurality of messages 

15 using a first message filter bank, said first message 
filter bank comprising at least one message filter; 

b. generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 

20 filter bank comprising at least one message filter; 

c. determining for each of the set of filtered 
second message vectors the degree of match between each of 
the filtered first message vectors and the filtered second 
message vector; 

25 d. determining from each of the degrees of match 

each one of the plurality of messages that is related to 

another of the plurality of messages; and 

e, determining from each of the plurality of 

messages that is related tc another of the plurality of 
30 messages a linked list of messages having successive 

parent-child relationships. _ - - 

22. A system for determining from a plurality of 
messages a second message that is related to a first 
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message , comprising : 

a. a processor; and 

b . memory; ■ 

wherein said processor is programmed Co execute the steps 
5 of : 

1. generating a filtered first message 
vector by filtering the first message using a first message 
filter bank, said first message filter bank comprising at 
least one message filter; 
10 2. generating a set of filtered second 

message vectors by filtering each of the plurality of 
messages using a second message filter bank, said second 
message filter bank comprising at least one message filter; 

3 . determining for each of the set of 

15 filtered second message vectors the degree of match between 
the filtered first message vector and the filtered second 
message vector; and 

4 . determining from each of the degrees of 
match which one of the plurality of messages is the second 

20 message. 

23. The system according to claim 22, wherein the 
relationship of the second message tc the first message is 
parent to child ; 

25 wherein the first message filter bank comprises a 

message filter that extracts a quoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts an unquoted portion of the 

30 message being filtered. 

24. The system according to claim 22, wherein the 
relationship of, the second message to the first message is 
child to parent ; 
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wherein the first message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered. 

25. The system according to claim 22, wherein the 
step of determining the degree of match between the 
filtered first message vector and the filtered second 
message vector comprises use of a statistical information 
retrieval function . 

26. The system according to claim 22, wherein the 
step of determining from each of the degrees of match which 
one of the plurality of messages is the second message 
comprises determining which one of each of the degrees of 
match is the maximum value and selecting the message 
corresponding to- the determined maximum value. 

27. The system according to claim 25, wherein the • 
step of determining the degree of match between the 
filtered first message vector and the filtered second 
message vector further comprises combining a set of values 
resulting from the statistical information retrieval 
function to form a single value representative of the 
degree of match. 

28. The system according to claim 27, wherein the 
step of determining from each of the degrees of match which 
one of the plurality of messages is. the second message 
comprises determining which element of the, tuple of values 
representative* of each of the degrees of match is the 
maximum value, and selecting the message corresponding to 
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the determined maximum value. 

29. The system according to claim 22, further 
comprising the step of if the first message is contained in 
5 the plurality of messages, removing the first message from 
the plurality of messages before filtering the plurality of 
messages using the second message filter bank. 

30.. The system according to claim 22, further 
10 comprising the step of verifying that the second message is 
related to the first message. 



31. The system according to claim 30, wherein the 
step of verifying that, the second message is related to the 
15 first message includes determining whether the degree of 
match between the filtered first message vector and the 
filtered second message vector corresponding to the 
determined second message exceeds a threshold value. 

20 32. The system according to claim 22, further 

comprising the step of presenting a list including the 
first message, at least one of the plurality of messages, 
and the degree of match between the filtered first message 
vector and the filtered second message vector corresponding 

25 to the at least one of the plurality of messages... 

33. A system for determining from a plurality of 
messages whether a second message is related to a first 
message, comprising the steps of: 
30 a. generating a filtered first message vector by 

filtering the first message using a, .first message filter 
bank, said first message filter bank comprising at least 
one messaae filter,* 
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b. generating a sec of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

c. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. determining for each of the set of filtered 
second message vectors whether the degree of match between 
the filtered first message vector and the filtered second 
message vector exceeds a threshold value . 

34 . A system for processing a plurality of messages 
that may be related to a first message, comprising the 
steps of: 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b. generating a set of filtered second message 
vectors by filtering each cf the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

c. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. presenting a list including the first 
message, at least one of the plurality of messages, and the 
degree of. match, between the filtered- first message vector 
and the filtered second message vector corresponding to the 
at least one of the plurality of messages. 
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35. A system for determining a thread of related 
messages from a plurality of messages, comprising: 

a. a processor; and 

b. memory; 

wherein said processor is programmed to execute the steps 
of : 

1. generating a filtered first message 
vector by filtering the first message using a firsc message 
filter bank, said first message filter bank comprising at 
least one message filter; 

2. if the first message is* contained in the 
plurality of messages, removing the first message from the 
plurality of messages ,- 

15 3 . generating a set of filtered second 

message vectors by* filtering each of the plurality of 
messages- using a second message filter bank, said second 
message filter bank comprising at least one message filter; 

4 . determining for each of the set of 

20 filtered second message vectors the-degree of match between 
the filtered first message vector and the filtered second 
message vector ; 

5: determining from each of the degrees of 
match whether one of the plurality of messages is a second 
25 message related to the first message; and 

6. if it is determined that one of 
plurality of messages is a second message is related to the 
first message, substituting the second message in place of 
the first message and repeating each of the steps a through 
30 f herein. 

36. The system according to claim 35, wherein the 
relationship of, the second message to the first message is 
parent to child; 
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wherein the first message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message, being filtered. 

37. The system according to claim 35, wherein the 
relationship of the second message to the first message is 
child to parent; 

wherein the first message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered. 



38. The system according to claim 35, wherein the 
step of determining the degree of match between the 
filtered first message vector and the filtered second 
message vector comprises use of a statistical information 
retrieval function. 

39. The system according to claim 35, wherein the 
step of determining from each of the degrees of match which 
one of the plurality of messages is the second message 
comprises determining which one of each of the degrees of 
match is the maximum value and selecting the message 
corresponding to the determined maximum value. 

40. The system according to claim 38, wherein the 
step of determining the degree of match between the 
filtered first* message vector and the filtered second 
message vectors further comprises combining a set of values 
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resulting from the statistical information retrieval 
function to form a single value representative of the 
degree of match. 

5 41. The system according to claim 40, wherein the 

step of determining from each of the degrees of match which 
one of the plurality of messages is the second message 
comprises determining which element of the vector 
representative of each of the degrees of match is the 
10 maximum value, and selecting the message corresponding to 
the determined maximum value. 

42. A system for determining a thread of related 
messages from a plurality of messages, comprising the steps 
15 Of : 

a. generating a set of filtered first message 
vectors by filtering each of the plurality of messages 
using a first message filter bank, said first message 
filter bank comprising at least one message filter; 
20 b. generating a set of ^filtered second message 

vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

c. determining for each of the set of filtered 
25 " second message vectors the degree of match between each of 

the filtered first message vectors and the filtered second 
message vector ; 

d. determining from each of the degrees of match 
each one of the plurality of messages that is related to 

30 another of the plurality of messages; and 

e. . determining from each_of* the plurality of 
messages that is related to another of the plurality of 
messages a linked list of messages having successive 
parent -child relationships . 
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43. An article of manufacture, comprising a computer- 
readable medium having stored thereon instructions for 
determining from a plurality of messages a second message 
that is related to a first message, said instructions 
which, when performed by a processor, cause the processor 
to execute the steps comprising the steps of: 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b. generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

c. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. determining from e^ach of the degrees of match 
which one of the plurality of messages is the second 
message . 

44. The article of manufacture according to claim 43, 
wherein the relationship of the second message to the first 
message is parent to child; 

wherein the first message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered. 
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45. The article of manufacture according to claim 43, 
wherein the relationship of the second message to the first 
message is child to parent; 

wherein the first message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered 



46. The article of manufacture according to claim 43, 
wherein the step of determining the degree of match between 
the filtered first message vector and the filtered second 
message vector comprises use of a statistical information 

15 retrieval function. 

47. The article of manufacture according to claim 43, 
wherein the step of determining from each of the degrees of 
match which one of the plurality of messages is the second 

20 message comprises determining which one of each of the 
degrees of match is the maximum value and selecting the 
message corresponding to the determined maximum value. 

48. The article of manufacture according to claim 46, 
25 wherein the step of determining the degree of match between 

the filtered first message vector and the filtered second 
message vector further comprises combining a set of values 
resulting from the statistical information retrieval 
function to form a single value representative of the 
30 degree of match. 

49. The article of manufacture according to claim 48, 
wherein the step of determining from each of the degrees of 
match which one of the plurality of messages is the second 
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message comprises determining which element of the tuple of 
values representative of each of the degrees of match is 
the maximum value, and selecting the message corresponding 
to the determined maximum value. 

50. The article of manufacture according to claim 43, 
further comprising the step of if the first message is 
contained in the plurality of messages, removing the first 
message from the plurality of messages before filtering the 
plurality of messages using the second message filter bank. 

51. The article of manufacture according to claim 43, 
further comprising the step of verifying that the second 
message is related to the first message. 

52. The article of manufacture according to claim 51, 
wherein the step of verifying that the second message is 
related to the first message includes determining whether 
the degree of match between the f il tered . first message 
vector and the filtered second message vector corresponding 
to the determined second message exceeds a threshold value. 

53. The article of manufacture according to claim 43, 
further comprising the step of presenting a list including 
the first message, at least one of the plurality of 
messages, and the degree of match between the filtered 
first message vector and the filtered second message vector 
corresponding to the at least one of the plurality of 
messages . 

54. .. An article of manufacture' comprising a computer- 
readable medium having stored thereon instructions for 
determining .from a plurality of messages whether a second 
message is related to a first message, said instructions 
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which, when performed by a processor, cause the processor 
to execute the steps comprising the steps of: 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b . generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

c . determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. determining for each of the set of filtered 
second message vectors whether the degree of match between 
the .filtered first message vector and the filtered second 
message vector exceeds a threshold value. 

55. An article of manufacture comprising a computer- 
readable medium having stored thereon instructions for 
processing a plurality of messages that may be related tc a 
first message, said instructions which, when performed by a 
processor, cause the processor to execute the steps 
comprising the steps of: 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b. ' generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second ^message filter bank, said second message 
filter bank Comprising at least one message filter; 
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c. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; and 

d. presenting a list including the first 
message, at least one of the plurality of messages, and the 
degree of match between the filtered first message vector 
and the filtered second message vector corresponding to the 
at least one of the plurality of messages. 

56. An article of manufacture, comprising a computer- 
readable medium having stored thereon instructions for 
determining a thread of related messages from a plurality 
of messages, said instructions which, when performed by a 
processor, cause the processor to execute the steps 
comprising ^ the steps of: 

a. generating a filtered first message vector by 
filtering the first message using a first message filter 
bank, said first message filter bank comprising at least 
one message filter; 

b. if the first message is contained in the 
plurality of messages, removing the first message from the 
plurality of messages; 

c. generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
filter bank comprising at least one message filter; 

d. determining for each of the set of filtered 
second message vectors the degree of match between the 
filtered first message vector and the filtered second 
message vector; ^ 

e. determining from each of the degrees of match 
whether one of the plurality of messages is a second 
message related to the first message; and 
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f. if it is determined that one of plurality of 
messages is a second message is related to the first 
message, substituting the second message in place of the 
first message and repeating each of the steps a through f 
herein . 

57. The article of manufacture according to claim 56, 
wherein the relationship of the second message to the first 
message is parent to child; 

wherein the first message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered. 

58. The article of manufacture according to claim 56, 
wherein the relationship of the second message to the first 
message is child to parent; 

wherein the first message filter bank comprises a 
message filter that extracts an unquoted portion of the 
message being filtered; and 

wherein the second message filter bank comprises a 
message filter that extracts a quoted portion of the 
message being filtered. 

59. The article of manufacture according to claim 56, 
wherein the step of determining the degree of match between 
the filtered first message vector and the filtered second 
message vector comprises use of a statistical information 
retrieval function. 

60. The -article of manufacture 
wherein the step of determining from 



according to claim 56, 
each of the degrees of 
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match which one of the plurality of messages is the second 
message comprises determining which one of each of the 
degrees of match is the maximum value and selecting the 
message corresponding to the determined maximum value. 

5 

61. The article of manufacture according to claim 59, 
wherein the step of determining the degree of match between 
the filtered first message vector and the filtered second 
message vector further comprises combining a set of values 
10 resulting from the statistical information retrieval 
function to form a single value representative of the 
degree of match. 



62. The article of manufacture according to claim 61, 
15 wherein the step of determining from each of the degrees of 

match which one of the plurality of messages is the second 
message comprises determining which element of the vector 
representative of each of the degrees of match is the 
maximum value, and selecting the message corresponding to 
20 the determined maximum value. 

63. An article of manufacture comprising a computer- 
readable medium having stored thereon instructions for 
determining a thread of related messages from a plurality 

25 of messages, said instructions which, when performed by a 
processor, cause the processor to execute the steps 
comprising the steps of : 

a. generating a set of filtered first message 
vectors by filtering each of the plurality of messages 

30 using a first message filter bank, said first message 
filter bank comprising at least one -message filter; 

b. generating a set of filtered second message 
vectors by filtering each of the plurality of messages 
using a second message filter bank, said second message 
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filter bank comprising at least one message filter; 

c. determining for each of the set of filtered 
second message vectors the degree of match between each of 
the filtered first message vectors and the filtered second 
message vector; 

d. determining from each of the degrees of match 
each one of the plurality of messages that is related to 
another of the plurality of messages; and 

e. determining from each of the plurality of 
messages that is related to another of the plurality of 
messages a linked list of messages having successive 
parent -child relationships . 
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